alpaca_eval by tatsu-lab

Automatic evaluator for instruction-following language models

Created 2 years ago

1,933 stars

Top 22.4% on SourcePulse

View on GitHub

8 Experts Love This Project

Philipp Schmid

DevRel at Google DeepMind

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Nathan Lambert

Research Scientist at AI2

Braden Hancock

Cofounder of Snorkel AI

and 4 more!

Project Summary

AlpacaEval provides an automatic evaluation framework for instruction-following language models, aiming to be fast, cheap, and highly correlated with human judgment. It's designed for researchers and developers needing to quickly assess and compare LLM performance on instruction-following tasks, offering a replicable alternative to manual evaluation.

How It Works

AlpacaEval uses a powerful LLM (like GPT-4) as an automatic annotator to compare outputs from different models against a reference model on a standardized instruction set. It calculates a "win rate" based on the annotator's preferences. AlpacaEval 2.0 introduces length-controlled win rates, which mitigate the bias towards longer outputs observed in earlier versions and improve correlation with human benchmarks like Chatbot Arena.

Quick Start & Requirements

Install: pip install alpaca-eval
API Key: export OPENAI_API_KEY=<your_api_key> is required for using OpenAI models.
Usage: alpaca_eval --model_outputs 'example/outputs.json'
Documentation: AlpacaEval GitHub

Highlighted Details

Achieves 0.98 Spearman correlation with Chatbot Arena using length-controlled win rates.
Evaluation costs less than $10 in OpenAI credits and runs in under 3 minutes.
Supports evaluating models directly from HuggingFace or various APIs.
Offers a toolkit for creating, analyzing, and comparing custom evaluators.

Maintenance & Community

The project is actively maintained by the tatsu-lab. Community contributions for models, evaluators, and datasets are welcomed. A Discord server is available for support.

Licensing & Compatibility

The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Automatic evaluators may exhibit biases (e.g., favoring output style over factuality, or preferring outputs from models similar to the annotator). The evaluation set might not fully represent advanced LLM use cases. AlpacaEval does not assess model safety or ethical risks.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days