AlignBench by THUDM

Chinese LLM evaluation benchmark (ACL 2024)

Created 2 years ago

421 stars

Top 69.8% on SourcePulse

Project Summary

AlignBench is a comprehensive benchmark designed to evaluate the alignment of Chinese Large Language Models (LLMs) with human intent across multiple dimensions. It addresses the limitations of existing benchmarks by providing a dynamic, human-curated dataset and a sophisticated LLM-as-Judge evaluation methodology for more reliable and interpretable assessments.

How It Works

AlignBench utilizes a structured dataset derived from real user queries, categorized into eight key areas including reasoning, writing, and professional knowledge. The evaluation employs GPT-4-0613 as a judge, leveraging Chain-of-Thought prompting for detailed, multi-dimensional analysis of model responses. Rule calibration, based on high-quality reference answers, ensures consistent scoring across different task types, enhancing the robustness of the evaluation.

Quick Start & Requirements

To evaluate a model:

Implement your model's API in inference/api_models/.
Generate model answers using python get_answers.py.
Obtain GPT-4 API key and configure it in config/multi-dimension.json.
Run evaluations with python judge.py.
Aggregate results using python show_result.py.

Highlighted Details

Comprehensive dataset of 683 real-world Chinese user queries across 8 categories.
LLM-as-Judge evaluation with Chain-of-Thought and rule calibration for reliability.
Detailed multi-dimensional analysis of model responses.
Includes a leaderboard of major LLMs evaluated on AlignBench v1.1.

Maintenance & Community

The project is associated with THUDM and has contributions from numerous researchers. The latest update (v1.1) includes manual corrections and source evidence for factual answers.

Licensing & Compatibility

The repository is released under an unspecified license. The dataset and code are intended for research purposes.

Limitations & Caveats

The evaluation relies on GPT-4 as the judge, which may introduce its own biases. The dataset is primarily sourced from real user queries, and while efforts are made to ensure quality, it may not cover all possible edge cases or adversarial scenarios.

AlignBench by THUDM

Explore Similar Projects

llm_benchmark by llm2014

URIAL by Re-Align

CValues by X-PLUG

turbo-alignment by turbo-llm

Awesome-LLMs-Evaluation-Papers by tjunlp-lab

fmeval by aws

PandaLM by WeOpenML

CMMLU by haonan-li

prometheus-eval by prometheus-eval

benchy by disler

chain-of-thought-hub by FranxYao

Qwen2.5-Math by QwenLM