chinese-llm-benchmark by jeinlee1991

Chinese LLM benchmark for evaluating capabilities across models

Created 2 years ago

5,050 stars

Top 9.9% on SourcePulse

Project Summary

This repository provides a comprehensive benchmark and leaderboard for Chinese Large Language Models (LLMs), aiming to offer an objective and fair evaluation system. It targets LLM developers, researchers, and users seeking to understand and compare model capabilities across various domains, with the benefit of informed model selection and development guidance.

How It Works

The project evaluates LLMs across 8 major domains (e.g., medical, education, finance, legal, reasoning, language) with approximately 300 sub-dimensions. It utilizes a scoring system where each question within a dataset is rated 1-5 based on response quality, with scores normalized to a 100-point scale per dataset. The overall score is an average of scores across these domains, providing a multi-faceted view of model performance.

Quick Start & Requirements

Access: The project primarily serves as a data and leaderboard repository. No direct installation or execution commands are provided for running benchmarks.
Data: Access to the evaluation datasets and model outputs is available via links within the README (e.g., alldata, eval file directory).
Resources: Evaluating LLMs requires significant computational resources, which are not detailed here but are implied by the scale of the project.

Highlighted Details

Benchmarks over 232 commercial and open-source LLMs, including detailed breakdowns by model size and pricing tiers.
Extensive domain coverage, featuring specialized rankings for medical professions (physician, nursing, pharmacy, etc.), educational levels (primary, secondary, college entrance exams), finance, law, and more.
A large "error database" of over 2 million entries for LLM mistakes, facilitating analysis and improvement.
Regular updates with new models and expanded evaluation dimensions, indicating active maintenance.

Maintenance & Community

The project is actively maintained, with frequent updates listed (e.g., v3.20 on 2025/4/30).
A WeChat group is available for community discussion and evaluation exchange.

Licensing & Compatibility

The repository itself appears to be open-source, but the specific license is not explicitly stated in the provided text.
Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README does not specify the exact methodology for generating the "error database" or the specific datasets used for each sub-dimension.
While extensive, the evaluation focuses on specific task types (e.g., multiple-choice, fill-in-the-blank) and may not cover all aspects of LLM capabilities like creative writing or complex dialogue.

Health Check

Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

112 stars in the last 30 days