autoevals  by braintrustdata

Evaluation tool for AI model outputs using automatic methods

Created 2 years ago
625 stars

Top 52.9% on SourcePulse

GitHubView on GitHub
Project Summary

AutoEvals is a Python and TypeScript library for evaluating AI model outputs using a variety of methods, including LLM-as-a-judge, statistical, and heuristic approaches. It aims to simplify the process of debugging, comparing, and managing AI evaluations, making it easier for developers and researchers to assess model performance across subjective tasks like fact-checking and safety.

How It Works

AutoEvals provides a unified interface for diverse evaluation metrics, normalizing results to a 0-1 scale. It simplifies complex tasks like parsing LLM-generated outputs and debugging individual evaluations by allowing flexible prompt tweaking and direct output inspection. The library supports custom evaluation prompts and user-defined scoring functions, enabling tailored assessment workflows.

Quick Start & Requirements

  • Python: pip install autoevals
  • TypeScript: npm install autoevals
  • Requirements: Python 3.9+, OpenAI Python SDK v0.x/v1.x compatible. Requires OPENAI_API_KEY environment variable for default OpenAI usage.
  • Docs: https://www.braintrust.dev/docs/reference/autoevals

Highlighted Details

  • Supports LLM-as-a-judge evaluations for tasks like Factuality, Moderation, and SQL.
  • Includes heuristic (Levenshtein, Exact Match) and statistical (BLEU) methods.
  • Enables custom LLM classifiers with user-defined prompts and scoring logic.
  • Integrates with Braintrust for logging and comparison of evaluation results.

Maintenance & Community

Developed by the team at Braintrust. Contribution guidelines and development setup are available in the README.

Licensing & Compatibility

The library appears to be open-source, but the specific license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require license clarification.

Limitations & Caveats

The README does not explicitly state the license, which could be a blocker for commercial adoption. While it supports various AI providers via OpenAI-compatible APIs, specific provider configurations might require further investigation.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
0
Star History
43 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

argilla by argilla-io

0.2%
5k
Collaboration tool for building high-quality AI datasets
Created 4 years ago
Updated 3 days ago
Feedback? Help us improve.