ceval  by hkust-nlp

Chinese eval suite for foundation models (NeurIPS 2023)

created 2 years ago
1,760 stars

Top 24.9% on sourcepulse

GitHubView on GitHub
Project Summary

C-Eval is a comprehensive Chinese evaluation suite designed for assessing the capabilities of foundation models across 52 diverse disciplines. It comprises 13,948 multi-choice questions categorized by difficulty and subject area, aiming to help developers track progress and identify model strengths and weaknesses in Chinese language understanding and reasoning.

How It Works

The suite provides multi-choice questions in various subjects, including STEM, Social Science, Humanities, and Others. It supports both zero-shot and few-shot evaluation methodologies. For few-shot evaluation, a "dev" split with explanations is available to guide models. The "val" split is intended for hyperparameter tuning, while the "test" split's labels are withheld, requiring submission for automatic evaluation.

Quick Start & Requirements

  • Data Download: Download ceval-exam.zip from Hugging Face or load via datasets.load_dataset("ceval/ceval-exam").
  • Evaluation: Can be performed directly using generated answers or via the lm-evaluation-harness framework (e.g., python main.py --model hf-causal --model_args pretrained=EleutherAI/gpt-j-6B --tasks Ceval-valid-computer_network --device cuda:0).
  • Submission: Requires preparing a UTF-8 encoded JSON file with predicted answers.
  • Resources: Requires standard Python environment with pandas and datasets.

Highlighted Details

  • Includes a "C-Eval Hard" benchmark focusing on challenging STEM subjects requiring complex reasoning.
  • Provides zero-shot and five-shot accuracy benchmarks for various large language models.
  • Offers detailed prompt templates for both answer-only and chain-of-thought evaluations.
  • Accepted to NeurIPS 2023 and integrated into lm-evaluation-harness.

Maintenance & Community

The project is associated with HKUST NLP. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

  • Code: MIT License.
  • Dataset: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This license restricts commercial use.

Limitations & Caveats

The dataset license prohibits commercial use. Test set labels are not released, necessitating submission to the platform for evaluation.

Health Check
Last commit

6 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
7
Star History
33 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.