VQA benchmark for evaluating spatial reasoning in MLLMs
Top 58.6% on sourcepulse
This repository provides VSI-Bench, a benchmark and evaluation framework for assessing the visual-spatial intelligence of Multimodal Large Language Models (MLLMs). It addresses the gap in understanding how MLLMs perceive, remember, and recall spatial information from video, offering a resource for researchers and developers in AI and robotics.
How It Works
VSI-Bench comprises over 5,000 question-answer pairs derived from 288 egocentric videos of indoor 3D scenes. It covers three task types: configurational, measurement estimation, and spatiotemporal, evaluated using accuracy for multiple-choice and Mean Relative Accuracy for numerical answers. The benchmark aims to enable MLLMs to build implicit "cognitive maps" of environments.
Quick Start & Requirements
pip install -e .
and specific packages like deepspeed
and s2wrapper
.datasets.load_dataset("nyu-visionx/VSI-Bench")
.bash evaluate_all_in-one.sh --model all --num_processes 8 --benchmark vsibench
.Highlighted Details
lmms-eval
toolkit.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The authors acknowledge that some imperfections may persist in the benchmark despite quality refinement efforts. Evaluation results for open-source models might differ slightly from published tables due to ongoing data refinement.
1 month ago
1 day