thinking-in-space  by vision-x-nyu

VQA benchmark for evaluating spatial reasoning in MLLMs

created 7 months ago
555 stars

Top 58.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides VSI-Bench, a benchmark and evaluation framework for assessing the visual-spatial intelligence of Multimodal Large Language Models (MLLMs). It addresses the gap in understanding how MLLMs perceive, remember, and recall spatial information from video, offering a resource for researchers and developers in AI and robotics.

How It Works

VSI-Bench comprises over 5,000 question-answer pairs derived from 288 egocentric videos of indoor 3D scenes. It covers three task types: configurational, measurement estimation, and spatiotemporal, evaluated using accuracy for multiple-choice and Mean Relative Accuracy for numerical answers. The benchmark aims to enable MLLMs to build implicit "cognitive maps" of environments.

Quick Start & Requirements

  • Installation: Requires Python 3.10, Git, and Conda. Install by cloning the repo, initializing submodules, and installing dependencies via pip install -e . and specific packages like deepspeed and s2wrapper.
  • Benchmark Access: Load the dataset using datasets.load_dataset("nyu-visionx/VSI-Bench").
  • Evaluation: Run bash evaluate_all_in-one.sh --model all --num_processes 8 --benchmark vsibench.
  • Resources: No specific hardware requirements like GPUs are mentioned for benchmark access or evaluation setup, but model inference will likely require significant computational resources.
  • Links: VSI-Bench on HuggingFace

Highlighted Details

  • Benchmarks 15 video-supporting MLLMs, including proprietary models like Gemini-1.5 and GPT-4o, and open-source models from InternVL2, ViLA, LongViLA, LongVA, LLaVA-OneVision, and LLaVA-NeXT-Video.
  • Evaluation is performed in zero-shot settings with default prompts and greedy decoding for reproducibility.
  • The benchmark is built upon the lmms-eval toolkit.
  • Paper accepted to CVPR 2025.

Maintenance & Community

  • The project is associated with New York University, Yale University, and Stanford University.
  • Feedback is encouraged for benchmark imperfections.
  • Citation details are provided for the paper "Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces".

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The underlying datasets (ScanNet, ScanNet++, ARKitScenes) may have their own licenses.

Limitations & Caveats

The authors acknowledge that some imperfections may persist in the benchmark despite quality refinement efforts. Evaluation results for open-source models might differ slightly from published tables due to ongoing data refinement.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
89 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.