MMVP by tsb0601

Benchmark for multimodal LLM visual capability evaluation

Created 2 years ago

356 stars

Top 78.6% on SourcePulse

Project Summary

This repository provides the MMVP benchmark and evaluation scripts to assess the visual capabilities of multimodal large language models (MLLMs). It targets researchers and developers working on vision-language models, offering a standardized way to identify and quantify their shortcomings in visual understanding, particularly in tasks requiring spatial reasoning and detailed visual grounding.

How It Works

The MMVP benchmark consists of 300 images with associated questions designed to probe specific visual reasoning abilities. The evaluation scripts process model outputs against ground truth answers, generating performance metrics. A secondary MMVP-VLM benchmark simplifies these tasks for evaluating vision-language models (VLMs) like CLIP, categorizing them into nine visual patterns to analyze performance trends across different visual reasoning types.

Quick Start & Requirements

Installation: Use Conda to create an environment (conda create -n mmvp python=3.10 -y, conda activate mmvp), navigate to the LLaVA directory, and install dependencies (pip install -e ., pip install flash-attn --no-build-isolation).
Prerequisites: Python 3.10, Conda, LLaVA framework, flash-attn.
Resources: Requires downloading the MMVP benchmark dataset (300 images + CSV) and potentially pre-trained models.
Links: Paper, Project Page, MMVP Benchmark, MMVP-VLM Benchmark.

Highlighted Details

Evaluates visual grounding and spatial reasoning in MLLMs.
Includes MMVP (300 images, VQA) and MMVP-VLM (simplified, 9 visual patterns) benchmarks.
Provides evaluation scripts (evaluate_mllm.py, evaluate_vlm.py) and an LLM-based grading script (gpt_grader.py).
Demonstrates significant shortcomings in state-of-the-art MLLMs and limited gains from scaling CLIP models.

Maintenance & Community

Built upon the LLaVA project.
Key contributors include Shengbang Tong, Zhuang Liu, Yi Ma, Yann LeCun, and Saining Xie.
Citation information provided via BibTeX.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The benchmark is designed to highlight visual shortcomings; models performing well on standard benchmarks may still struggle. The gpt_grader.py script requires an OpenAI API key and may incur costs.

MMVP by tsb0601

Explore Similar Projects

VARGPT by VARGPT-family

Visual-CoT by deepcs233

Open-LLaVA-NeXT by xiaoachen98

Multi-Modality-Arena by OpenGVLab

vstar by penghao-wu

Video-LLaVA by PKU-YuanGroup

LLaVA-NeXT by LLaVA-VL

lmms-eval by EvolvingLMMs-Lab

Bagel by ByteDance-Seed

minimind-v by jingyaogong

prismatic-vlms by TRI-ML

LLaVA by haotian-liu