ceval by hkust-nlp

Chinese eval suite for foundation models (NeurIPS 2023)

Created 3 years ago

1,858 stars

Top 22.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Binyuan Hui

Research Scientist at Alibaba Qwen

Project Summary

C-Eval is a comprehensive Chinese evaluation suite designed for assessing the capabilities of foundation models across 52 diverse disciplines. It comprises 13,948 multi-choice questions categorized by difficulty and subject area, aiming to help developers track progress and identify model strengths and weaknesses in Chinese language understanding and reasoning.

How It Works

The suite provides multi-choice questions in various subjects, including STEM, Social Science, Humanities, and Others. It supports both zero-shot and few-shot evaluation methodologies. For few-shot evaluation, a "dev" split with explanations is available to guide models. The "val" split is intended for hyperparameter tuning, while the "test" split's labels are withheld, requiring submission for automatic evaluation.

Quick Start & Requirements

Data Download: Download ceval-exam.zip from Hugging Face or load via datasets.load_dataset("ceval/ceval-exam").
Evaluation: Can be performed directly using generated answers or via the lm-evaluation-harness framework (e.g., python main.py --model hf-causal --model_args pretrained=EleutherAI/gpt-j-6B --tasks Ceval-valid-computer_network --device cuda:0).
Submission: Requires preparing a UTF-8 encoded JSON file with predicted answers.
Resources: Requires standard Python environment with pandas and datasets.

Highlighted Details

Includes a "C-Eval Hard" benchmark focusing on challenging STEM subjects requiring complex reasoning.
Provides zero-shot and five-shot accuracy benchmarks for various large language models.
Offers detailed prompt templates for both answer-only and chain-of-thought evaluations.
Accepted to NeurIPS 2023 and integrated into lm-evaluation-harness.

Maintenance & Community

The project is associated with HKUST NLP. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

Code: MIT License.
Dataset: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This license restricts commercial use.

Limitations & Caveats

The dataset license prohibits commercial use. Test set labels are not released, necessitating submission to the platform for evaluation.

Health Check

Last Commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days