Chinese LLM evaluation benchmark (ACL 2024)
Top 73.3% on sourcepulse
AlignBench is a comprehensive benchmark designed to evaluate the alignment of Chinese Large Language Models (LLMs) with human intent across multiple dimensions. It addresses the limitations of existing benchmarks by providing a dynamic, human-curated dataset and a sophisticated LLM-as-Judge evaluation methodology for more reliable and interpretable assessments.
How It Works
AlignBench utilizes a structured dataset derived from real user queries, categorized into eight key areas including reasoning, writing, and professional knowledge. The evaluation employs GPT-4-0613 as a judge, leveraging Chain-of-Thought prompting for detailed, multi-dimensional analysis of model responses. Rule calibration, based on high-quality reference answers, ensures consistent scoring across different task types, enhancing the robustness of the evaluation.
Quick Start & Requirements
To evaluate a model:
inference/api_models/
.python get_answers.py
.config/multi-dimension.json
.python judge.py
.python show_result.py
.Highlighted Details
Maintenance & Community
The project is associated with THUDM and has contributions from numerous researchers. The latest update (v1.1) includes manual corrections and source evidence for factual answers.
Licensing & Compatibility
The repository is released under an unspecified license. The dataset and code are intended for research purposes.
Limitations & Caveats
The evaluation relies on GPT-4 as the judge, which may introduce its own biases. The dataset is primarily sourced from real user queries, and while efforts are made to ensure quality, it may not cover all possible edge cases or adversarial scenarios.
11 months ago
1 day