CMMLU by haonan-li

Chinese eval benchmark for language models' knowledge/reasoning

Created 2 years ago

803 stars

Top 43.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

CMMLU is a comprehensive benchmark designed to evaluate the knowledge and reasoning capabilities of language models specifically within the Chinese language context. It covers 67 diverse subjects, ranging from fundamental sciences to advanced professional fields, including China-specific topics and common sense knowledge. The benchmark is intended for researchers and developers working on or evaluating Chinese language models.

How It Works

CMMLU presents a series of multiple-choice questions, each with four options and a single correct answer. The dataset is structured into development and testing subsets for each of the 67 topics. The evaluation methodology supports both zero-shot and few-shot (specifically five-shot) learning scenarios, allowing for a nuanced assessment of model performance under different prompting conditions.

Quick Start & Requirements

The dataset is available on Hugging Face.
CMMLU is integrated into popular evaluation frameworks like lm-evaluation-harness and OpenCompass.
Pre-processing code for generating direct answer and chain-of-thought (COT) prompts is provided in src/mp_utils.

Highlighted Details

Features 67 subjects covering STEM, humanities, social sciences, and China-specific topics.
Includes a leaderboard for tracking model performance in both zero-shot and five-shot settings.
Provides example prompts for direct answering and chain-of-thought reasoning.
Data is provided in CSV format.

Maintenance & Community

The project is associated with authors from institutions including Shanghai Jiao Tong University and the University of Melbourne.
Submission of results for open models can be done via pull requests. For closed models, results can be emailed for verification and inclusion.

Licensing & Compatibility

The CMMLU dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
This license restricts commercial use and requires derivative works to be shared under the same license.

Limitations & Caveats

The benchmark is specifically tailored for Chinese language understanding and may not be suitable for evaluating models in other languages.
The "NonCommercial" clause in the license restricts its use in commercial products or services.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days