chinese-llm-benchmark  by jeinlee1991

Chinese LLM benchmark for evaluating capabilities across models

created 2 years ago
4,593 stars

Top 10.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive benchmark and leaderboard for Chinese Large Language Models (LLMs), aiming to offer an objective and fair evaluation system. It targets LLM developers, researchers, and users seeking to understand and compare model capabilities across various domains, with the benefit of informed model selection and development guidance.

How It Works

The project evaluates LLMs across 8 major domains (e.g., medical, education, finance, legal, reasoning, language) with approximately 300 sub-dimensions. It utilizes a scoring system where each question within a dataset is rated 1-5 based on response quality, with scores normalized to a 100-point scale per dataset. The overall score is an average of scores across these domains, providing a multi-faceted view of model performance.

Quick Start & Requirements

  • Access: The project primarily serves as a data and leaderboard repository. No direct installation or execution commands are provided for running benchmarks.
  • Data: Access to the evaluation datasets and model outputs is available via links within the README (e.g., alldata, eval file directory).
  • Resources: Evaluating LLMs requires significant computational resources, which are not detailed here but are implied by the scale of the project.

Highlighted Details

  • Benchmarks over 232 commercial and open-source LLMs, including detailed breakdowns by model size and pricing tiers.
  • Extensive domain coverage, featuring specialized rankings for medical professions (physician, nursing, pharmacy, etc.), educational levels (primary, secondary, college entrance exams), finance, law, and more.
  • A large "error database" of over 2 million entries for LLM mistakes, facilitating analysis and improvement.
  • Regular updates with new models and expanded evaluation dimensions, indicating active maintenance.

Maintenance & Community

  • The project is actively maintained, with frequent updates listed (e.g., v3.20 on 2025/4/30).
  • A WeChat group is available for community discussion and evaluation exchange.

Licensing & Compatibility

  • The repository itself appears to be open-source, but the specific license is not explicitly stated in the provided text.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

  • The README does not specify the exact methodology for generating the "error database" or the specific datasets used for each sub-dimension.
  • While extensive, the evaluation focuses on specific task types (e.g., multiple-choice, fill-in-the-blank) and may not cover all aspects of LLM capabilities like creative writing or complex dialogue.
Health Check
Last commit

1 day ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
4
Star History
511 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.