thinking-in-space by vision-x-nyu

VQA benchmark for evaluating spatial reasoning in MLLMs

Created 1 year ago

660 stars

Top 51.0% on SourcePulse

Project Summary

This repository provides VSI-Bench, a benchmark and evaluation framework for assessing the visual-spatial intelligence of Multimodal Large Language Models (MLLMs). It addresses the gap in understanding how MLLMs perceive, remember, and recall spatial information from video, offering a resource for researchers and developers in AI and robotics.

How It Works

VSI-Bench comprises over 5,000 question-answer pairs derived from 288 egocentric videos of indoor 3D scenes. It covers three task types: configurational, measurement estimation, and spatiotemporal, evaluated using accuracy for multiple-choice and Mean Relative Accuracy for numerical answers. The benchmark aims to enable MLLMs to build implicit "cognitive maps" of environments.

Quick Start & Requirements

Installation: Requires Python 3.10, Git, and Conda. Install by cloning the repo, initializing submodules, and installing dependencies via pip install -e . and specific packages like deepspeed and s2wrapper.
Benchmark Access: Load the dataset using datasets.load_dataset("nyu-visionx/VSI-Bench").
Evaluation: Run bash evaluate_all_in-one.sh --model all --num_processes 8 --benchmark vsibench.
Resources: No specific hardware requirements like GPUs are mentioned for benchmark access or evaluation setup, but model inference will likely require significant computational resources.
Links: VSI-Bench on HuggingFace

Highlighted Details

Benchmarks 15 video-supporting MLLMs, including proprietary models like Gemini-1.5 and GPT-4o, and open-source models from InternVL2, ViLA, LongViLA, LongVA, LLaVA-OneVision, and LLaVA-NeXT-Video.
Evaluation is performed in zero-shot settings with default prompts and greedy decoding for reproducibility.
The benchmark is built upon the lmms-eval toolkit.
Paper accepted to CVPR 2025.

Maintenance & Community

The project is associated with New York University, Yale University, and Stanford University.
Feedback is encouraged for benchmark imperfections.
Citation details are provided for the paper "Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces".

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. The underlying datasets (ScanNet, ScanNet++, ARKitScenes) may have their own licenses.

Limitations & Caveats

The authors acknowledge that some imperfections may persist in the benchmark despite quality refinement efforts. Evaluation results for open-source models might differ slightly from published tables due to ongoing data refinement.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days