CMMLU  by haonan-li

Chinese eval benchmark for language models' knowledge/reasoning

created 2 years ago
774 stars

Top 46.0% on sourcepulse

GitHubView on GitHub
Project Summary

CMMLU is a comprehensive benchmark designed to evaluate the knowledge and reasoning capabilities of language models specifically within the Chinese language context. It covers 67 diverse subjects, ranging from fundamental sciences to advanced professional fields, including China-specific topics and common sense knowledge. The benchmark is intended for researchers and developers working on or evaluating Chinese language models.

How It Works

CMMLU presents a series of multiple-choice questions, each with four options and a single correct answer. The dataset is structured into development and testing subsets for each of the 67 topics. The evaluation methodology supports both zero-shot and few-shot (specifically five-shot) learning scenarios, allowing for a nuanced assessment of model performance under different prompting conditions.

Quick Start & Requirements

  • The dataset is available on Hugging Face.
  • CMMLU is integrated into popular evaluation frameworks like lm-evaluation-harness and OpenCompass.
  • Pre-processing code for generating direct answer and chain-of-thought (COT) prompts is provided in src/mp_utils.

Highlighted Details

  • Features 67 subjects covering STEM, humanities, social sciences, and China-specific topics.
  • Includes a leaderboard for tracking model performance in both zero-shot and five-shot settings.
  • Provides example prompts for direct answering and chain-of-thought reasoning.
  • Data is provided in CSV format.

Maintenance & Community

  • The project is associated with authors from institutions including Shanghai Jiao Tong University and the University of Melbourne.
  • Submission of results for open models can be done via pull requests. For closed models, results can be emailed for verification and inclusion.

Licensing & Compatibility

  • The CMMLU dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
  • This license restricts commercial use and requires derivative works to be shared under the same license.

Limitations & Caveats

  • The benchmark is specifically tailored for Chinese language understanding and may not be suitable for evaluating models in other languages.
  • The "NonCommercial" clause in the license restricts its use in commercial products or services.
Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.