MMLU-Pro  by TIGER-AI-Lab

Challenging benchmark for multi-task language understanding (NeurIPS 2024 paper)

created 1 year ago
264 stars

Top 97.5% on sourcepulse

GitHubView on GitHub
Project Summary

MMLU-Pro is an enhanced benchmark for evaluating large language models' understanding and reasoning capabilities, building upon the original MMLU dataset. It targets researchers and developers seeking a more rigorous assessment of model performance, offering increased difficulty and reduced susceptibility to prompt variations.

How It Works

MMLU-Pro expands the original MMLU by increasing answer choices from four to ten and incorporating more challenging, reasoning-focused questions sourced from academic exams and textbooks. This design aims to significantly raise the benchmark's difficulty, making random guessing less effective and better reflecting expert-level knowledge and complex reasoning. The benchmark's robustness is further demonstrated by its reduced sensitivity to prompt variations compared to MMLU.

Quick Start & Requirements

  • Local Inference: cd scripts/examples/ && sh eval_llama_2_7b.sh
  • API Inference: cd scripts/examples/ && sh eval_gpt_4.sh (requires API key modification)
  • Dataset: Available on Hugging Face 🤗 Dataset.
  • Leaderboard: 🏆Leaderboard

Highlighted Details

  • Comprises over 12,000 questions across 14 diverse domains.
  • Accuracy drops by 16-33% compared to MMLU, indicating increased difficulty.
  • Model score sensitivity to prompt variations decreased from 4-5% to 2%.
  • Chain of Thought (CoT) reasoning shows improved performance over direct answering.

Maintenance & Community

Licensing & Compatibility

  • The repository contains code and data for a research paper. Specific licensing for the code and data is not explicitly stated in the README, but the paper is available on arXiv.

Limitations & Caveats

The README does not specify the exact license for the code or data, which may impact commercial use or integration into closed-source projects. The benchmark is presented as a research artifact from a NeurIPS 2024 paper.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.