MMLU-Pro by TIGER-AI-Lab

Challenging benchmark for multi-task language understanding (NeurIPS 2024 paper)

Created 1 year ago

322 stars

Top 84.5% on SourcePulse

Project Summary

MMLU-Pro is an enhanced benchmark for evaluating large language models' understanding and reasoning capabilities, building upon the original MMLU dataset. It targets researchers and developers seeking a more rigorous assessment of model performance, offering increased difficulty and reduced susceptibility to prompt variations.

How It Works

MMLU-Pro expands the original MMLU by increasing answer choices from four to ten and incorporating more challenging, reasoning-focused questions sourced from academic exams and textbooks. This design aims to significantly raise the benchmark's difficulty, making random guessing less effective and better reflecting expert-level knowledge and complex reasoning. The benchmark's robustness is further demonstrated by its reduced sensitivity to prompt variations compared to MMLU.

Quick Start & Requirements

Local Inference: cd scripts/examples/ && sh eval_llama_2_7b.sh
API Inference: cd scripts/examples/ && sh eval_gpt_4.sh (requires API key modification)
Dataset: Available on Hugging Face 🤗 Dataset.
Leaderboard: 🏆Leaderboard

Highlighted Details

Comprises over 12,000 questions across 14 diverse domains.
Accuracy drops by 16-33% compared to MMLU, indicating increased difficulty.
Model score sensitivity to prompt variations decreased from 4-5% to 2%.
Chain of Thought (CoT) reasoning shows improved performance over direct answering.

Maintenance & Community

Paper: 📖 Paper
Contact: Yubo Wang (y726wang@uwaterloo.ca), Xueguang Ma (x93ma@uwaterloo.ca), Wenhu Chen (wenhuchen@uwaterloo.ca).

Licensing & Compatibility

The repository contains code and data for a research paper. Specific licensing for the code and data is not explicitly stated in the README, but the paper is available on arXiv.

Limitations & Caveats

The README does not specify the exact license for the code or data, which may impact commercial use or integration into closed-source projects. The benchmark is presented as a research artifact from a NeurIPS 2024 paper.

MMLU-Pro by TIGER-AI-Lab

Explore Similar Projects

l1 by cmu-l3

Slow_Thinking_with_LLMs by RUCAIBox

GAOKAO-Bench by OpenLMLab

TTRL by PRIME-RL

ScienceQA by lupantech

Prompt4ReasoningPapers by zjunlp

tree-of-thought-prompting by dave1010

Marco-o1 by AIDC-AI

PRIME by PRIME-RL

LongBench by THUDM

chain-of-thought-hub by FranxYao

g1 by build-with-groq