MM-Vet by yuweihao

Benchmark for evaluating large multimodal models (LMMs)

Created 2 years ago

317 stars

Top 85.2% on SourcePulse

Project Summary

MM-Vet provides a comprehensive benchmark for evaluating Large Multimodal Models (LMMs) by assessing their integrated capabilities across various tasks. It is designed for researchers and developers working on LMMs, offering a standardized framework to measure performance beyond single-task evaluations and identify areas for improvement in models aiming for general-purpose multimodal understanding.

How It Works

MM-Vet evaluates LMMs on a diverse set of tasks that require the integration of multiple core vision-language capabilities, including recognition, OCR, knowledge retrieval, language generation, spatial awareness, and mathematical reasoning. Unlike traditional benchmarks that focus on isolated skills, MM-Vet's methodology emphasizes the synergistic application of these abilities, providing a more holistic assessment of an LMM's real-world utility and integrated intelligence.

Quick Start & Requirements

Install openai package: pip install openai>=1
Obtain API access for GPT-4/GPT-3.5.
Download the MM-Vet dataset.
Inference scripts are provided for models like GPT-4V and Gemini.
Evaluation is performed using an LLM-based evaluator script or an online Hugging Face Space.
See official quick-start for detailed steps.

Highlighted Details

MM-Vet v2 includes "image-text sequence understanding" and an expanded evaluation set.
GPT-4V achieved 67.7% on MM-Vet, outperforming other models by a significant margin.
Gemini Pro Vision scored 64.3%, and Qwen-VL-Max achieved 66.6%.
A public leaderboard is available for tracking model performance.

Maintenance & Community

The project is associated with ICML 2024.
Updates include new model inference scripts and dataset extensions.
A leaderboard is hosted on PapersWithCode.

Licensing & Compatibility

Code is licensed under Apache 2.0.
The dataset is licensed under CC BY-NC 4.0, which may restrict commercial use.

Limitations & Caveats

The evaluation relies on GPT-4 for grading, which can be a bottleneck and introduces potential biases or limitations inherent to the grading model. The CC BY-NC 4.0 license for the dataset restricts commercial applications.

MM-Vet by yuweihao

Explore Similar Projects

UltraEval by OpenBMB

VisionThink by dvlab-research

Multi-Modality-Arena by OpenGVLab

instruct-eval by declare-lab

Awesome-LLM-Eval by onejune2018

evalchemy by mlfoundations

evalscope by modelscope

helm by stanford-crfm

lmms-eval by EvolvingLMMs-Lab

VLMEvalKit by open-compass

opencompass by open-compass

lm-evaluation-harness by EleutherAI