athina-evals by athina-ai

Python SDK for LLM response evaluation

Created 2 years ago

297 stars

Top 89.5% on SourcePulse

1 Expert Loves This Project

ebursztein

Cybersecurity Lead at Google DeepMind

Project Summary

Athina-evals provides a Python SDK for evaluating Large Language Model (LLM) responses, offering over 50 preset evaluations and support for custom ones. It's designed for AI teams focused on observability and experimentation, serving as a companion to the Athina IDE for prototyping, running experiments, and comparing datasets.

How It Works

The SDK allows programmatic execution of evaluations, with results visualized and managed within the Athina IDE. This integrated approach facilitates side-by-side dataset comparison and experiment tracking, streamlining the LLM development lifecycle.

Quick Start & Requirements

Install: pip install athina-evals
For CodeExecutionV2 evaluations, install e2b-code-interpreter.
Requires an Athina API key, obtainable from https://app.athina.ai.
Quick start guide available in a notebook: https://github.com/athina-ai/athina-evals/blob/main/notebooks/quickstart.ipynb

Highlighted Details

Over 50 preset evaluations available.
Supports custom evaluation creation.
Integrates with Athina IDE for enhanced workflow.
Enables side-by-side dataset comparison.

Maintenance & Community

No specific contributor or community details are provided in the README.

Licensing & Compatibility

The README does not specify a license.

Limitations & Caveats

The README does not detail any limitations or caveats.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

evalyn by shihongDev

GenAI application evaluation framework

Created 2 months ago

Updated 3 days ago

Starred by

Shyamal Anadkat

Shyamal Anadkat(Research Scientist at OpenAI),

Travis Fischer

Travis Fischer(Founder of Agentic), and

2 more.

autoevals by braintrustdata

Evaluation tool for AI model outputs using automatic methods

Created 2 years ago

Updated 1 day ago

fmeval by aws

Evaluate foundation models for various NLP tasks

Created 2 years ago

Updated 6 months ago

PandaLM by WeOpenML

LLM evaluation benchmark for reproducible, automated assessment

Created 2 years ago

Updated 1 year ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI).

prometheus-eval by prometheus-eval

LLM evaluation framework using open LLMs

Created 1 year ago

Updated 10 months ago

Starred by

Marc Klingen

Marc Klingen(Cofounder of Langfuse),

Vasek Mlejnsky

Vasek Mlejnsky(Cofounder of E2B), and

1 more.

openevals by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

Created 1 year ago

Updated 19 hours ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

3 more.

promptbench by microsoft

LLM evaluation framework

Created 2 years ago

Updated 5 days ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

7 more.

ChainForge by ianarawjo

Visual environment for LLM prompt battle-testing

Created 2 years ago

Updated 1 month ago

Starred by

Morgan Funtowicz

Morgan Funtowicz(Head of ML Optimizations at Hugging Face),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

8 more.

lighteval by huggingface

LLM evaluation toolkit for multiple backends

Created 2 years ago

Updated 5 days ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Jared Palmer

Jared Palmer(SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX), and

3 more.

human-eval by openai

Evaluation harness for LLMs trained on code

Created 4 years ago

Updated 1 year ago

Starred by

Gregor Zunic

Gregor Zunic(Cofounder of Browser Use),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

15 more.

ragas by vibrantlabsai

Toolkit for LLM application evaluation

Created 2 years ago

Updated 1 day ago

Starred by

Anastasios Angelopoulos

Anastasios Angelopoulos(Cofounder of LMArena),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

35 more.

evals by openai

Framework for evaluating LLMs and LLM systems, plus benchmark registry

Created 3 years ago

Updated 3 months ago

Feedback? Help us improve.