MMMU  by MMMU-Benchmark

Multimodal benchmark for expert AGI evaluation

created 1 year ago
470 stars

Top 65.7% on sourcepulse

GitHubView on GitHub
Project Summary

MMMU is a benchmark suite designed to evaluate multimodal large language models (MLLMs) on college-level subject knowledge and complex reasoning across diverse disciplines. It targets researchers and developers building expert-level Artificial General Intelligence (AGI) systems, offering a rigorous assessment of advanced perception and reasoning capabilities beyond existing benchmarks.

How It Works

MMMU comprises 11.5K multimodal questions from 30 subjects across six disciplines, featuring 32 heterogeneous image types. MMMU-Pro enhances this by filtering text-only questions, augmenting options with plausible distractors, and using a vision-only input setting to force simultaneous visual and textual comprehension. This approach aims to simulate expert-level cognitive tasks and provide a more robust evaluation of intrinsic multimodal understanding.

Quick Start & Requirements

  • Evaluation: Code is available in evaluation folders.
  • Datasets: Available on Hugging Face: MMMU Dataset, MMMU-Pro Dataset.
  • Test Set Submission: Via EvalAI.
  • Prerequisites: Python, specific libraries (details in evaluation folders).

Highlighted Details

  • Covers 6 disciplines, 30 subjects, and 183 subfields.
  • Includes 32 heterogeneous image types (charts, diagrams, chemical structures, etc.).
  • MMMU-Pro significantly reduces model accuracy (16.8%-26.9%) compared to MMMU (GPT-4V at 56%), highlighting its increased difficulty.
  • Investigates the impact of OCR prompts and Chain of Thought (CoT) reasoning.

Maintenance & Community

  • Active development with recent updates introducing MMMU-Pro and human expert performance.
  • Contact points provided for inquiries: Xiang Yue, Yu Su, Wenhu Chen.
  • Citations provided for both MMMU and MMMU-Pro papers.

Licensing & Compatibility

  • The README does not explicitly state a license for the evaluation code or datasets.
  • The project emphasizes compliance with copyright and licensing rules from original data sources, with a mechanism for reporting potential infringements.

Limitations & Caveats

The test set answers and explanations are withheld, requiring submission to EvalAI for evaluation. The specific license for the evaluation code and datasets is not clearly stated in the README, which may impact commercial use or integration into closed-source projects.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
53 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.