GAOKAO-Bench  by OpenLMLab

Evaluation framework for assessing LLMs using Chinese GAOKAO (college entrance exam) questions

Created 2 years ago
692 stars

Top 49.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GAOKAO-Bench provides a standardized framework for evaluating large language models using China's Gaokao (National College Entrance Examination) questions. It aims to comprehensively assess models' language understanding and logical reasoning capabilities, offering a robust benchmark for the LLM community.

How It Works

The framework leverages a curated dataset of 2811 Gaokao questions (2010-2022), comprising 1781 objective and 1030 subjective questions. Objective questions are evaluated using rule-based answer extraction, while subjective questions are assessed through human grading or an LLM-as-a-Judge approach (specifically using GPT-4-turbo). This dual approach allows for a nuanced evaluation of both factual recall and complex reasoning.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Obtain OpenAI API key for model inference and LLM-as-a-Judge.
  • Run objective evaluation: python objective_bench.py --openai_api_key="your openai api key"
  • Run subjective evaluation: python subjective_bench.py --openai_api_key="your openai api key"
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Benchmarks GPT-4, Gemini-Pro, ERNIE-Bot, and others on specific subjects.
  • Includes a multimodal extension, GAOKAO-MM, for evaluating vision-language models.
  • Provides scripts for scoring objective and subjective questions, and merging results.
  • Offers a zero-shot evaluation methodology.

Maintenance & Community

The project is associated with OpenLMLab and includes contributions from researchers at Shanghai CaoYang No.2 High School for subjective question scoring. Further updates are available via GAOKAO-Bench-Updates.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Subjective question grading relies on human evaluators or LLM-as-a-Judge, which can introduce variability. The dataset is primarily focused on Chinese Gaokao questions, potentially limiting direct applicability to other educational systems.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

hle by centerforaisafety

0.4%
1k
Multimodal benchmark for frontier human knowledge
Created 9 months ago
Updated 3 weeks ago
Feedback? Help us improve.