alpaca_eval  by tatsu-lab

Automatic evaluator for instruction-following language models

created 2 years ago
1,816 stars

Top 24.3% on sourcepulse

GitHubView on GitHub
Project Summary

AlpacaEval provides an automatic evaluation framework for instruction-following language models, aiming to be fast, cheap, and highly correlated with human judgment. It's designed for researchers and developers needing to quickly assess and compare LLM performance on instruction-following tasks, offering a replicable alternative to manual evaluation.

How It Works

AlpacaEval uses a powerful LLM (like GPT-4) as an automatic annotator to compare outputs from different models against a reference model on a standardized instruction set. It calculates a "win rate" based on the annotator's preferences. AlpacaEval 2.0 introduces length-controlled win rates, which mitigate the bias towards longer outputs observed in earlier versions and improve correlation with human benchmarks like Chatbot Arena.

Quick Start & Requirements

  • Install: pip install alpaca-eval
  • API Key: export OPENAI_API_KEY=<your_api_key> is required for using OpenAI models.
  • Usage: alpaca_eval --model_outputs 'example/outputs.json'
  • Documentation: AlpacaEval GitHub

Highlighted Details

  • Achieves 0.98 Spearman correlation with Chatbot Arena using length-controlled win rates.
  • Evaluation costs less than $10 in OpenAI credits and runs in under 3 minutes.
  • Supports evaluating models directly from HuggingFace or various APIs.
  • Offers a toolkit for creating, analyzing, and comparing custom evaluators.

Maintenance & Community

The project is actively maintained by the tatsu-lab. Community contributions for models, evaluators, and datasets are welcomed. A Discord server is available for support.

Licensing & Compatibility

The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Automatic evaluators may exhibit biases (e.g., favoring output style over factuality, or preferring outputs from models similar to the annotator). The evaluation set might not fully represent advanced LLM use cases. AlpacaEval does not assess model safety or ethical risks.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
88 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Simon Willison Simon Willison(Author of Django), and
9 more.

simple-evals by openai

0.4%
4k
Lightweight library for evaluating language models
created 1 year ago
updated 3 weeks ago
Feedback? Help us improve.