Benchmark for evaluating foundation models on human-centric tasks
Top 46.8% on sourcepulse
AGIEval is a human-centric benchmark designed to assess the general cognitive and problem-solving abilities of foundation models. It comprises 20 official, high-standard admission and qualification exams, targeting a broad range of human knowledge and reasoning skills. The benchmark is valuable for researchers and developers seeking to evaluate and compare the performance of large language models on complex, real-world tasks.
How It Works
AGIEval utilizes a curated dataset derived from 20 diverse exams, including college entrance exams (e.g., Gaokao, SAT), professional qualification tests (e.g., law, civil service), and math competitions. The data is structured for evaluation, with multiple-choice questions and cloze tasks. The benchmark facilitates few-shot and zero-shot evaluations, allowing models to be tested with or without prior examples, providing insights into their adaptability and generalization capabilities.
Quick Start & Requirements
To replicate baseline results:
openai_api.py
with your OpenAI API key.run_prediction.py
to generate model outputs.post_process_and_evaluation.py
for evaluation.
Data can be downloaded from data/v1_1
.Highlighted Details
Maintenance & Community
The project is associated with Microsoft and welcomes contributions via pull requests, requiring agreement to a Contributor License Agreement (CLA). It adheres to the Microsoft Open Source Code of Conduct.
Licensing & Compatibility
All data usage must follow the licenses of the original datasets. The repository itself does not specify a license, but the presence of a CLA and Microsoft branding suggests a permissive license for contributions, though original data licenses may impose restrictions.
Limitations & Caveats
The benchmark focuses on specific exam formats and may not cover all aspects of general intelligence. The evaluation relies on API access to models like GPT-4o, requiring API keys and incurring costs. The dataset version 1.1 updated some Chinese Gaokao datasets and standardized answer formats for MCQs.
1 year ago
1 week