GAOKAO-Bench  by OpenLMLab

Evaluation framework for assessing LLMs using Chinese GAOKAO (college entrance exam) questions

created 2 years ago
672 stars

Top 51.2% on sourcepulse

GitHubView on GitHub
Project Summary

GAOKAO-Bench provides a standardized framework for evaluating large language models using China's Gaokao (National College Entrance Examination) questions. It aims to comprehensively assess models' language understanding and logical reasoning capabilities, offering a robust benchmark for the LLM community.

How It Works

The framework leverages a curated dataset of 2811 Gaokao questions (2010-2022), comprising 1781 objective and 1030 subjective questions. Objective questions are evaluated using rule-based answer extraction, while subjective questions are assessed through human grading or an LLM-as-a-Judge approach (specifically using GPT-4-turbo). This dual approach allows for a nuanced evaluation of both factual recall and complex reasoning.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Obtain OpenAI API key for model inference and LLM-as-a-Judge.
  • Run objective evaluation: python objective_bench.py --openai_api_key="your openai api key"
  • Run subjective evaluation: python subjective_bench.py --openai_api_key="your openai api key"
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Benchmarks GPT-4, Gemini-Pro, ERNIE-Bot, and others on specific subjects.
  • Includes a multimodal extension, GAOKAO-MM, for evaluating vision-language models.
  • Provides scripts for scoring objective and subjective questions, and merging results.
  • Offers a zero-shot evaluation methodology.

Maintenance & Community

The project is associated with OpenLMLab and includes contributions from researchers at Shanghai CaoYang No.2 High School for subjective question scoring. Further updates are available via GAOKAO-Bench-Updates.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Subjective question grading relies on human evaluators or LLM-as-a-Judge, which can introduce variability. The dataset is primarily focused on Chinese Gaokao questions, potentially limiting direct applicability to other educational systems.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.