MM-Vet  by yuweihao

Benchmark for evaluating large multimodal models (LMMs)

created 2 years ago
306 stars

Top 88.7% on sourcepulse

GitHubView on GitHub
Project Summary

MM-Vet provides a comprehensive benchmark for evaluating Large Multimodal Models (LMMs) by assessing their integrated capabilities across various tasks. It is designed for researchers and developers working on LMMs, offering a standardized framework to measure performance beyond single-task evaluations and identify areas for improvement in models aiming for general-purpose multimodal understanding.

How It Works

MM-Vet evaluates LMMs on a diverse set of tasks that require the integration of multiple core vision-language capabilities, including recognition, OCR, knowledge retrieval, language generation, spatial awareness, and mathematical reasoning. Unlike traditional benchmarks that focus on isolated skills, MM-Vet's methodology emphasizes the synergistic application of these abilities, providing a more holistic assessment of an LMM's real-world utility and integrated intelligence.

Quick Start & Requirements

  • Install openai package: pip install openai>=1
  • Obtain API access for GPT-4/GPT-3.5.
  • Download the MM-Vet dataset.
  • Inference scripts are provided for models like GPT-4V and Gemini.
  • Evaluation is performed using an LLM-based evaluator script or an online Hugging Face Space.
  • See official quick-start for detailed steps.

Highlighted Details

  • MM-Vet v2 includes "image-text sequence understanding" and an expanded evaluation set.
  • GPT-4V achieved 67.7% on MM-Vet, outperforming other models by a significant margin.
  • Gemini Pro Vision scored 64.3%, and Qwen-VL-Max achieved 66.6%.
  • A public leaderboard is available for tracking model performance.

Maintenance & Community

  • The project is associated with ICML 2024.
  • Updates include new model inference scripts and dataset extensions.
  • A leaderboard is hosted on PapersWithCode.

Licensing & Compatibility

  • Code is licensed under Apache 2.0.
  • The dataset is licensed under CC BY-NC 4.0, which may restrict commercial use.

Limitations & Caveats

The evaluation relies on GPT-4 for grading, which can be a bottleneck and introduces potential biases or limitations inherent to the grading model. The CC BY-NC 4.0 license for the dataset restricts commercial applications.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.