alpaca_eval  by tatsu-lab

Automatic evaluator for instruction-following language models

Created 2 years ago
1,855 stars

Top 23.4% on SourcePulse

GitHubView on GitHub
Project Summary

AlpacaEval provides an automatic evaluation framework for instruction-following language models, aiming to be fast, cheap, and highly correlated with human judgment. It's designed for researchers and developers needing to quickly assess and compare LLM performance on instruction-following tasks, offering a replicable alternative to manual evaluation.

How It Works

AlpacaEval uses a powerful LLM (like GPT-4) as an automatic annotator to compare outputs from different models against a reference model on a standardized instruction set. It calculates a "win rate" based on the annotator's preferences. AlpacaEval 2.0 introduces length-controlled win rates, which mitigate the bias towards longer outputs observed in earlier versions and improve correlation with human benchmarks like Chatbot Arena.

Quick Start & Requirements

  • Install: pip install alpaca-eval
  • API Key: export OPENAI_API_KEY=<your_api_key> is required for using OpenAI models.
  • Usage: alpaca_eval --model_outputs 'example/outputs.json'
  • Documentation: AlpacaEval GitHub

Highlighted Details

  • Achieves 0.98 Spearman correlation with Chatbot Arena using length-controlled win rates.
  • Evaluation costs less than $10 in OpenAI credits and runs in under 3 minutes.
  • Supports evaluating models directly from HuggingFace or various APIs.
  • Offers a toolkit for creating, analyzing, and comparing custom evaluators.

Maintenance & Community

The project is actively maintained by the tatsu-lab. Community contributions for models, evaluators, and datasets are welcomed. A Discord server is available for support.

Licensing & Compatibility

The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Automatic evaluators may exhibit biases (e.g., favoring output style over factuality, or preferring outputs from models similar to the annotator). The evaluation set might not fully represent advanced LLM use cases. AlpacaEval does not assess model safety or ethical risks.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
18 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
3 more.

promptbench by microsoft

0.1%
3k
LLM evaluation framework
Created 2 years ago
Updated 1 month ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
34 more.

evals by openai

0.2%
17k
Framework for evaluating LLMs and LLM systems, plus benchmark registry
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.