MMVP  by tsb0601

Benchmark for multimodal LLM visual capability evaluation

Created 1 year ago
352 stars

Top 79.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the MMVP benchmark and evaluation scripts to assess the visual capabilities of multimodal large language models (MLLMs). It targets researchers and developers working on vision-language models, offering a standardized way to identify and quantify their shortcomings in visual understanding, particularly in tasks requiring spatial reasoning and detailed visual grounding.

How It Works

The MMVP benchmark consists of 300 images with associated questions designed to probe specific visual reasoning abilities. The evaluation scripts process model outputs against ground truth answers, generating performance metrics. A secondary MMVP-VLM benchmark simplifies these tasks for evaluating vision-language models (VLMs) like CLIP, categorizing them into nine visual patterns to analyze performance trends across different visual reasoning types.

Quick Start & Requirements

  • Installation: Use Conda to create an environment (conda create -n mmvp python=3.10 -y, conda activate mmvp), navigate to the LLaVA directory, and install dependencies (pip install -e ., pip install flash-attn --no-build-isolation).
  • Prerequisites: Python 3.10, Conda, LLaVA framework, flash-attn.
  • Resources: Requires downloading the MMVP benchmark dataset (300 images + CSV) and potentially pre-trained models.
  • Links: Paper, Project Page, MMVP Benchmark, MMVP-VLM Benchmark.

Highlighted Details

  • Evaluates visual grounding and spatial reasoning in MLLMs.
  • Includes MMVP (300 images, VQA) and MMVP-VLM (simplified, 9 visual patterns) benchmarks.
  • Provides evaluation scripts (evaluate_mllm.py, evaluate_vlm.py) and an LLM-based grading script (gpt_grader.py).
  • Demonstrates significant shortcomings in state-of-the-art MLLMs and limited gains from scaling CLIP models.

Maintenance & Community

  • Built upon the LLaVA project.
  • Key contributors include Shengbang Tong, Zhuang Liu, Yi Ma, Yann LeCun, and Saining Xie.
  • Citation information provided via BibTeX.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The benchmark is designed to highlight visual shortcomings; models performing well on standard benchmarks may still struggle. The gpt_grader.py script requires an OpenAI API key and may incur costs.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Zack Li Zack Li(Cofounder of Nexa AI), and
19 more.

LLaVA by haotian-liu

0.2%
24k
Multimodal assistant with GPT-4 level capabilities
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.