GAOKAO-Bench by OpenLMLab

Evaluation framework for assessing LLMs using Chinese GAOKAO (college entrance exam) questions

Created 2 years ago

708 stars

Top 48.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

GAOKAO-Bench provides a standardized framework for evaluating large language models using China's Gaokao (National College Entrance Examination) questions. It aims to comprehensively assess models' language understanding and logical reasoning capabilities, offering a robust benchmark for the LLM community.

How It Works

The framework leverages a curated dataset of 2811 Gaokao questions (2010-2022), comprising 1781 objective and 1030 subjective questions. Objective questions are evaluated using rule-based answer extraction, while subjective questions are assessed through human grading or an LLM-as-a-Judge approach (specifically using GPT-4-turbo). This dual approach allows for a nuanced evaluation of both factual recall and complex reasoning.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Obtain OpenAI API key for model inference and LLM-as-a-Judge.
Run objective evaluation: python objective_bench.py --openai_api_key="your openai api key"
Run subjective evaluation: python subjective_bench.py --openai_api_key="your openai api key"
Official documentation and examples are available within the repository.

Highlighted Details

Benchmarks GPT-4, Gemini-Pro, ERNIE-Bot, and others on specific subjects.
Includes a multimodal extension, GAOKAO-MM, for evaluating vision-language models.
Provides scripts for scoring objective and subjective questions, and merging results.
Offers a zero-shot evaluation methodology.

Maintenance & Community

The project is associated with OpenLMLab and includes contributions from researchers at Shanghai CaoYang No.2 High School for subjective question scoring. Further updates are available via GAOKAO-Bench-Updates.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Subjective question grading relies on human evaluators or LLM-as-a-Judge, which can introduce variability. The dataset is primarily focused on Chinese Gaokao questions, potentially limiting direct applicability to other educational systems.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days