autoevals  by braintrustdata

Evaluation tool for AI model outputs using automatic methods

created 2 years ago
575 stars

Top 56.9% on sourcepulse

GitHubView on GitHub
Project Summary

AutoEvals is a Python and TypeScript library for evaluating AI model outputs using a variety of methods, including LLM-as-a-judge, statistical, and heuristic approaches. It aims to simplify the process of debugging, comparing, and managing AI evaluations, making it easier for developers and researchers to assess model performance across subjective tasks like fact-checking and safety.

How It Works

AutoEvals provides a unified interface for diverse evaluation metrics, normalizing results to a 0-1 scale. It simplifies complex tasks like parsing LLM-generated outputs and debugging individual evaluations by allowing flexible prompt tweaking and direct output inspection. The library supports custom evaluation prompts and user-defined scoring functions, enabling tailored assessment workflows.

Quick Start & Requirements

  • Python: pip install autoevals
  • TypeScript: npm install autoevals
  • Requirements: Python 3.9+, OpenAI Python SDK v0.x/v1.x compatible. Requires OPENAI_API_KEY environment variable for default OpenAI usage.
  • Docs: https://www.braintrust.dev/docs/reference/autoevals

Highlighted Details

  • Supports LLM-as-a-judge evaluations for tasks like Factuality, Moderation, and SQL.
  • Includes heuristic (Levenshtein, Exact Match) and statistical (BLEU) methods.
  • Enables custom LLM classifiers with user-defined prompts and scoring logic.
  • Integrates with Braintrust for logging and comparison of evaluation results.

Maintenance & Community

Developed by the team at Braintrust. Contribution guidelines and development setup are available in the README.

Licensing & Compatibility

The library appears to be open-source, but the specific license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require license clarification.

Limitations & Caveats

The README does not explicitly state the license, which could be a blocker for commercial adoption. While it supports various AI providers via OpenAI-compatible APIs, specific provider configurations might require further investigation.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
109 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.