Benchmark for multimodal LLM visual capability evaluation
Top 81.6% on sourcepulse
This repository provides the MMVP benchmark and evaluation scripts to assess the visual capabilities of multimodal large language models (MLLMs). It targets researchers and developers working on vision-language models, offering a standardized way to identify and quantify their shortcomings in visual understanding, particularly in tasks requiring spatial reasoning and detailed visual grounding.
How It Works
The MMVP benchmark consists of 300 images with associated questions designed to probe specific visual reasoning abilities. The evaluation scripts process model outputs against ground truth answers, generating performance metrics. A secondary MMVP-VLM benchmark simplifies these tasks for evaluating vision-language models (VLMs) like CLIP, categorizing them into nine visual patterns to analyze performance trends across different visual reasoning types.
Quick Start & Requirements
conda create -n mmvp python=3.10 -y
, conda activate mmvp
), navigate to the LLaVA directory, and install dependencies (pip install -e .
, pip install flash-attn --no-build-isolation
).Highlighted Details
evaluate_mllm.py
, evaluate_vlm.py
) and an LLM-based grading script (gpt_grader.py
).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The benchmark is designed to highlight visual shortcomings; models performing well on standard benchmarks may still struggle. The gpt_grader.py
script requires an OpenAI API key and may incur costs.
1 year ago
1 week