MMMU by MMMU-Benchmark

Multimodal benchmark for expert AGI evaluation

Created 2 years ago

536 stars

Top 59.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Haotian Liu

Author of LLaVA; Research Scientist at xAI

Project Summary

MMMU is a benchmark suite designed to evaluate multimodal large language models (MLLMs) on college-level subject knowledge and complex reasoning across diverse disciplines. It targets researchers and developers building expert-level Artificial General Intelligence (AGI) systems, offering a rigorous assessment of advanced perception and reasoning capabilities beyond existing benchmarks.

How It Works

MMMU comprises 11.5K multimodal questions from 30 subjects across six disciplines, featuring 32 heterogeneous image types. MMMU-Pro enhances this by filtering text-only questions, augmenting options with plausible distractors, and using a vision-only input setting to force simultaneous visual and textual comprehension. This approach aims to simulate expert-level cognitive tasks and provide a more robust evaluation of intrinsic multimodal understanding.

Quick Start & Requirements

Evaluation: Code is available in evaluation folders.
Datasets: Available on Hugging Face: MMMU Dataset, MMMU-Pro Dataset.
Test Set Submission: Via EvalAI.
Prerequisites: Python, specific libraries (details in evaluation folders).

Highlighted Details

Covers 6 disciplines, 30 subjects, and 183 subfields.
Includes 32 heterogeneous image types (charts, diagrams, chemical structures, etc.).
MMMU-Pro significantly reduces model accuracy (16.8%-26.9%) compared to MMMU (GPT-4V at 56%), highlighting its increased difficulty.
Investigates the impact of OCR prompts and Chain of Thought (CoT) reasoning.

Maintenance & Community

Active development with recent updates introducing MMMU-Pro and human expert performance.
Contact points provided for inquiries: Xiang Yue, Yu Su, Wenhu Chen.
Citations provided for both MMMU and MMMU-Pro papers.

Licensing & Compatibility

The README does not explicitly state a license for the evaluation code or datasets.
The project emphasizes compliance with copyright and licensing rules from original data sources, with a mechanism for reporting potential infringements.

Limitations & Caveats

The test set answers and explanations are withheld, requiring submission to EvalAI for evaluation. The specific license for the evaluation code and datasets is not clearly stated in the README, which may impact commercial use or integration into closed-source projects.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days