MMVP  by tsb0601

Benchmark for multimodal LLM visual capability evaluation

created 1 year ago
344 stars

Top 81.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the MMVP benchmark and evaluation scripts to assess the visual capabilities of multimodal large language models (MLLMs). It targets researchers and developers working on vision-language models, offering a standardized way to identify and quantify their shortcomings in visual understanding, particularly in tasks requiring spatial reasoning and detailed visual grounding.

How It Works

The MMVP benchmark consists of 300 images with associated questions designed to probe specific visual reasoning abilities. The evaluation scripts process model outputs against ground truth answers, generating performance metrics. A secondary MMVP-VLM benchmark simplifies these tasks for evaluating vision-language models (VLMs) like CLIP, categorizing them into nine visual patterns to analyze performance trends across different visual reasoning types.

Quick Start & Requirements

  • Installation: Use Conda to create an environment (conda create -n mmvp python=3.10 -y, conda activate mmvp), navigate to the LLaVA directory, and install dependencies (pip install -e ., pip install flash-attn --no-build-isolation).
  • Prerequisites: Python 3.10, Conda, LLaVA framework, flash-attn.
  • Resources: Requires downloading the MMVP benchmark dataset (300 images + CSV) and potentially pre-trained models.
  • Links: Paper, Project Page, MMVP Benchmark, MMVP-VLM Benchmark.

Highlighted Details

  • Evaluates visual grounding and spatial reasoning in MLLMs.
  • Includes MMVP (300 images, VQA) and MMVP-VLM (simplified, 9 visual patterns) benchmarks.
  • Provides evaluation scripts (evaluate_mllm.py, evaluate_vlm.py) and an LLM-based grading script (gpt_grader.py).
  • Demonstrates significant shortcomings in state-of-the-art MLLMs and limited gains from scaling CLIP models.

Maintenance & Community

  • Built upon the LLaVA project.
  • Key contributors include Shengbang Tong, Zhuang Liu, Yi Ma, Yann LeCun, and Saining Xie.
  • Citation information provided via BibTeX.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The benchmark is designed to highlight visual shortcomings; models performing well on standard benchmarks may still struggle. The gpt_grader.py script requires an OpenAI API key and may incur costs.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.