Challenging benchmark for multi-task language understanding (NeurIPS 2024 paper)
Top 97.5% on sourcepulse
MMLU-Pro is an enhanced benchmark for evaluating large language models' understanding and reasoning capabilities, building upon the original MMLU dataset. It targets researchers and developers seeking a more rigorous assessment of model performance, offering increased difficulty and reduced susceptibility to prompt variations.
How It Works
MMLU-Pro expands the original MMLU by increasing answer choices from four to ten and incorporating more challenging, reasoning-focused questions sourced from academic exams and textbooks. This design aims to significantly raise the benchmark's difficulty, making random guessing less effective and better reflecting expert-level knowledge and complex reasoning. The benchmark's robustness is further demonstrated by its reduced sensitivity to prompt variations compared to MMLU.
Quick Start & Requirements
cd scripts/examples/ && sh eval_llama_2_7b.sh
cd scripts/examples/ && sh eval_gpt_4.sh
(requires API key modification)Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify the exact license for the code or data, which may impact commercial use or integration into closed-source projects. The benchmark is presented as a research artifact from a NeurIPS 2024 paper.
5 months ago
1 week