AGIEval  by ruixiangcui

Benchmark for evaluating foundation models on human-centric tasks

created 2 years ago
758 stars

Top 46.8% on sourcepulse

GitHubView on GitHub
Project Summary

AGIEval is a human-centric benchmark designed to assess the general cognitive and problem-solving abilities of foundation models. It comprises 20 official, high-standard admission and qualification exams, targeting a broad range of human knowledge and reasoning skills. The benchmark is valuable for researchers and developers seeking to evaluate and compare the performance of large language models on complex, real-world tasks.

How It Works

AGIEval utilizes a curated dataset derived from 20 diverse exams, including college entrance exams (e.g., Gaokao, SAT), professional qualification tests (e.g., law, civil service), and math competitions. The data is structured for evaluation, with multiple-choice questions and cloze tasks. The benchmark facilitates few-shot and zero-shot evaluations, allowing models to be tested with or without prior examples, providing insights into their adaptability and generalization capabilities.

Quick Start & Requirements

To replicate baseline results:

  1. Update openai_api.py with your OpenAI API key.
  2. Run run_prediction.py to generate model outputs.
  3. Execute post_process_and_evaluation.py for evaluation. Data can be downloaded from data/v1_1.

Highlighted Details

  • Evaluates models on 20 official exams, including Gaokao, SAT, LSAT, and math competitions.
  • Supports both Chinese (AGIEval-zh) and English (AGIEval-en) datasets.
  • Provides baseline results for GPT-3.5-turbo and GPT-4o, with a public leaderboard.
  • Data format includes questions, options, and correct answers (label/answer fields).

Maintenance & Community

The project is associated with Microsoft and welcomes contributions via pull requests, requiring agreement to a Contributor License Agreement (CLA). It adheres to the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

All data usage must follow the licenses of the original datasets. The repository itself does not specify a license, but the presence of a CLA and Microsoft branding suggests a permissive license for contributions, though original data licenses may impose restrictions.

Limitations & Caveats

The benchmark focuses on specific exam formats and may not cover all aspects of general intelligence. The evaluation relies on API access to models like GPT-4o, requiring API keys and incurring costs. The dataset version 1.1 updated some Chinese Gaokao datasets and standardized answer formats for MCQs.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.