This repository provides a comprehensive benchmark and leaderboard for Chinese Large Language Models (LLMs), aiming to offer an objective and fair evaluation system. It targets LLM developers, researchers, and users seeking to understand and compare model capabilities across various domains, with the benefit of informed model selection and development guidance.
How It Works
The project evaluates LLMs across 8 major domains (e.g., medical, education, finance, legal, reasoning, language) with approximately 300 sub-dimensions. It utilizes a scoring system where each question within a dataset is rated 1-5 based on response quality, with scores normalized to a 100-point scale per dataset. The overall score is an average of scores across these domains, providing a multi-faceted view of model performance.
Quick Start & Requirements
- Access: The project primarily serves as a data and leaderboard repository. No direct installation or execution commands are provided for running benchmarks.
- Data: Access to the evaluation datasets and model outputs is available via links within the README (e.g.,
alldata
, eval
file directory).
- Resources: Evaluating LLMs requires significant computational resources, which are not detailed here but are implied by the scale of the project.
Highlighted Details
- Benchmarks over 232 commercial and open-source LLMs, including detailed breakdowns by model size and pricing tiers.
- Extensive domain coverage, featuring specialized rankings for medical professions (physician, nursing, pharmacy, etc.), educational levels (primary, secondary, college entrance exams), finance, law, and more.
- A large "error database" of over 2 million entries for LLM mistakes, facilitating analysis and improvement.
- Regular updates with new models and expanded evaluation dimensions, indicating active maintenance.
Maintenance & Community
- The project is actively maintained, with frequent updates listed (e.g., v3.20 on 2025/4/30).
- A WeChat group is available for community discussion and evaluation exchange.
Licensing & Compatibility
- The repository itself appears to be open-source, but the specific license is not explicitly stated in the provided text.
- Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
- The README does not specify the exact methodology for generating the "error database" or the specific datasets used for each sub-dimension.
- While extensive, the evaluation focuses on specific task types (e.g., multiple-choice, fill-in-the-blank) and may not cover all aspects of LLM capabilities like creative writing or complex dialogue.