arena-hard-auto  by lmarena

Automatic LLM benchmark for instruction-tuned models, correlating with human preference

created 1 year ago
888 stars

Top 41.5% on sourcepulse

GitHubView on GitHub
Project Summary

Arena-Hard-Auto provides an automated LLM evaluation framework designed to mimic the challenging, open-ended prompts found in the Chatbot Arena. It targets LLM developers and researchers seeking to benchmark their models against human preferences using automated judges like GPT-4.1 and Gemini-2.5, offering a faster and cheaper alternative to human evaluation.

How It Works

The system leverages a curated set of "hard" prompts, similar to those used in Chatbot Arena, to stress-test LLM capabilities. It employs advanced LLMs (GPT-4.1, Gemini-2.5) as automated judges to assess model responses, aiming for high correlation and separability with human judgments. The framework also introduces novel metrics like "Separability with Confidence" to better evaluate a benchmark's ability to distinguish between similarly performing models.

Quick Start & Requirements

  • Install: Clone the repository (git clone https://github.com/lmarena/arena-hard-auto.git), navigate into the directory, and install dependencies (pip install -r requirements.txt, pip install -r requirements-optional.txt).
  • Data: Download pre-generated model answers and judgments using git lfs and cloning the Hugging Face dataset (git clone git@hf.co:datasets/lmarena-ai/arena-hard-auto arena-hard-data).
  • Prerequisites: Python, Git LFS. API access to evaluation models (GPT-4.1, Gemini-2.5) is required for generating judgments.
  • Resources: Evaluation involves running LLMs for answer generation and judgment, which can be resource-intensive.
  • Docs: https://github.com/lmarena/arena-hard-auto

Highlighted Details

  • Arena-Hard-v2.0 includes 500 new hard prompts and 250 creative writing prompts.
  • Supports "Style Control" for evaluating responses based on specific attributes like token length and markdown elements.
  • Introduces novel metrics: Separability with Confidence, Agreement with Confidence, and Pair Rank Brier Score.
  • Offers an optional benchmark viewer using Gradio.

Maintenance & Community

The project is associated with LMArena and has active development, indicated by recent updates and a forthcoming v2.0. Community contributions are welcomed via pull requests or issues. Contact information for adding models to the leaderboard is provided.

Licensing & Compatibility

The repository's code is likely under a permissive license, but the specific license is not explicitly stated in the README. The data and evaluation methodology are described in associated papers.

Limitations & Caveats

The use of LLMs as judges introduces potential biases and limitations inherent to those models. The README notes that for leaderboard submission, an OpenAI compatible endpoint and API key are required for judgment inference, which may be an inconvenience for some users.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
98 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.