AlignBench  by THUDM

Chinese LLM evaluation benchmark (ACL 2024)

created 1 year ago
401 stars

Top 73.3% on sourcepulse

GitHubView on GitHub
Project Summary

AlignBench is a comprehensive benchmark designed to evaluate the alignment of Chinese Large Language Models (LLMs) with human intent across multiple dimensions. It addresses the limitations of existing benchmarks by providing a dynamic, human-curated dataset and a sophisticated LLM-as-Judge evaluation methodology for more reliable and interpretable assessments.

How It Works

AlignBench utilizes a structured dataset derived from real user queries, categorized into eight key areas including reasoning, writing, and professional knowledge. The evaluation employs GPT-4-0613 as a judge, leveraging Chain-of-Thought prompting for detailed, multi-dimensional analysis of model responses. Rule calibration, based on high-quality reference answers, ensures consistent scoring across different task types, enhancing the robustness of the evaluation.

Quick Start & Requirements

To evaluate a model:

  1. Implement your model's API in inference/api_models/.
  2. Generate model answers using python get_answers.py.
  3. Obtain GPT-4 API key and configure it in config/multi-dimension.json.
  4. Run evaluations with python judge.py.
  5. Aggregate results using python show_result.py.

Highlighted Details

  • Comprehensive dataset of 683 real-world Chinese user queries across 8 categories.
  • LLM-as-Judge evaluation with Chain-of-Thought and rule calibration for reliability.
  • Detailed multi-dimensional analysis of model responses.
  • Includes a leaderboard of major LLMs evaluated on AlignBench v1.1.

Maintenance & Community

The project is associated with THUDM and has contributions from numerous researchers. The latest update (v1.1) includes manual corrections and source evidence for factual answers.

Licensing & Compatibility

The repository is released under an unspecified license. The dataset and code are intended for research purposes.

Limitations & Caveats

The evaluation relies on GPT-4 as the judge, which may introduce its own biases. The dataset is primarily sourced from real user queries, and while efforts are made to ensure quality, it may not cover all possible edge cases or adversarial scenarios.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.