Automatic LLM benchmark for instruction-tuned models, correlating with human preference
Top 41.5% on sourcepulse
Arena-Hard-Auto provides an automated LLM evaluation framework designed to mimic the challenging, open-ended prompts found in the Chatbot Arena. It targets LLM developers and researchers seeking to benchmark their models against human preferences using automated judges like GPT-4.1 and Gemini-2.5, offering a faster and cheaper alternative to human evaluation.
How It Works
The system leverages a curated set of "hard" prompts, similar to those used in Chatbot Arena, to stress-test LLM capabilities. It employs advanced LLMs (GPT-4.1, Gemini-2.5) as automated judges to assess model responses, aiming for high correlation and separability with human judgments. The framework also introduces novel metrics like "Separability with Confidence" to better evaluate a benchmark's ability to distinguish between similarly performing models.
Quick Start & Requirements
git clone https://github.com/lmarena/arena-hard-auto.git
), navigate into the directory, and install dependencies (pip install -r requirements.txt
, pip install -r requirements-optional.txt
).git lfs
and cloning the Hugging Face dataset (git clone git@hf.co:datasets/lmarena-ai/arena-hard-auto arena-hard-data
).Highlighted Details
Maintenance & Community
The project is associated with LMArena and has active development, indicated by recent updates and a forthcoming v2.0. Community contributions are welcomed via pull requests or issues. Contact information for adding models to the leaderboard is provided.
Licensing & Compatibility
The repository's code is likely under a permissive license, but the specific license is not explicitly stated in the README. The data and evaluation methodology are described in associated papers.
Limitations & Caveats
The use of LLMs as judges introduces potential biases and limitations inherent to those models. The README notes that for leaderboard submission, an OpenAI compatible endpoint and API key are required for judgment inference, which may be an inconvenience for some users.
1 month ago
1 day