arena-hard-auto by lmarena

Automatic LLM benchmark for instruction-tuned models, correlating with human preference

Created 2 years ago

979 stars

Top 37.7% on SourcePulse

View on GitHub

11 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Pawel Garbacki

Cofounder of Fireworks AI

Maxime Labonne

Head of Post-Training at Liquid AI

Zhuohan Li

Coauthor of vLLM

and 7 more!

Project Summary

Arena-Hard-Auto provides an automated LLM evaluation framework designed to mimic the challenging, open-ended prompts found in the Chatbot Arena. It targets LLM developers and researchers seeking to benchmark their models against human preferences using automated judges like GPT-4.1 and Gemini-2.5, offering a faster and cheaper alternative to human evaluation.

How It Works

The system leverages a curated set of "hard" prompts, similar to those used in Chatbot Arena, to stress-test LLM capabilities. It employs advanced LLMs (GPT-4.1, Gemini-2.5) as automated judges to assess model responses, aiming for high correlation and separability with human judgments. The framework also introduces novel metrics like "Separability with Confidence" to better evaluate a benchmark's ability to distinguish between similarly performing models.

Quick Start & Requirements

Install: Clone the repository (git clone https://github.com/lmarena/arena-hard-auto.git), navigate into the directory, and install dependencies (pip install -r requirements.txt, pip install -r requirements-optional.txt).
Data: Download pre-generated model answers and judgments using git lfs and cloning the Hugging Face dataset (git clone git@hf.co:datasets/lmarena-ai/arena-hard-auto arena-hard-data).
Prerequisites: Python, Git LFS. API access to evaluation models (GPT-4.1, Gemini-2.5) is required for generating judgments.
Resources: Evaluation involves running LLMs for answer generation and judgment, which can be resource-intensive.
Docs: https://github.com/lmarena/arena-hard-auto

Highlighted Details

Arena-Hard-v2.0 includes 500 new hard prompts and 250 creative writing prompts.
Supports "Style Control" for evaluating responses based on specific attributes like token length and markdown elements.
Introduces novel metrics: Separability with Confidence, Agreement with Confidence, and Pair Rank Brier Score.
Offers an optional benchmark viewer using Gradio.

Maintenance & Community

The project is associated with LMArena and has active development, indicated by recent updates and a forthcoming v2.0. Community contributions are welcomed via pull requests or issues. Contact information for adding models to the leaderboard is provided.

Licensing & Compatibility

The repository's code is likely under a permissive license, but the specific license is not explicitly stated in the README. The data and evaluation methodology are described in associated papers.

Limitations & Caveats

The use of LLMs as judges introduces potential biases and limitations inherent to those models. The README notes that for leaderboard submission, an OpenAI compatible endpoint and API key are required for judgment inference, which may be an inconvenience for some users.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days