Automatic evaluator for instruction-following language models
Top 24.3% on sourcepulse
AlpacaEval provides an automatic evaluation framework for instruction-following language models, aiming to be fast, cheap, and highly correlated with human judgment. It's designed for researchers and developers needing to quickly assess and compare LLM performance on instruction-following tasks, offering a replicable alternative to manual evaluation.
How It Works
AlpacaEval uses a powerful LLM (like GPT-4) as an automatic annotator to compare outputs from different models against a reference model on a standardized instruction set. It calculates a "win rate" based on the annotator's preferences. AlpacaEval 2.0 introduces length-controlled win rates, which mitigate the bias towards longer outputs observed in earlier versions and improve correlation with human benchmarks like Chatbot Arena.
Quick Start & Requirements
pip install alpaca-eval
export OPENAI_API_KEY=<your_api_key>
is required for using OpenAI models.alpaca_eval --model_outputs 'example/outputs.json'
Highlighted Details
Maintenance & Community
The project is actively maintained by the tatsu-lab. Community contributions for models, evaluators, and datasets are welcomed. A Discord server is available for support.
Licensing & Compatibility
The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Automatic evaluators may exhibit biases (e.g., favoring output style over factuality, or preferring outputs from models similar to the annotator). The evaluation set might not fully represent advanced LLM use cases. AlpacaEval does not assess model safety or ethical risks.
7 months ago
1 week