Discover and explore top open-source AI tools and projects—updated daily.
vision-x-nyuVQA benchmark for evaluating spatial reasoning in MLLMs
Top 53.6% on SourcePulse
This repository provides VSI-Bench, a benchmark and evaluation framework for assessing the visual-spatial intelligence of Multimodal Large Language Models (MLLMs). It addresses the gap in understanding how MLLMs perceive, remember, and recall spatial information from video, offering a resource for researchers and developers in AI and robotics.
How It Works
VSI-Bench comprises over 5,000 question-answer pairs derived from 288 egocentric videos of indoor 3D scenes. It covers three task types: configurational, measurement estimation, and spatiotemporal, evaluated using accuracy for multiple-choice and Mean Relative Accuracy for numerical answers. The benchmark aims to enable MLLMs to build implicit "cognitive maps" of environments.
Quick Start & Requirements
pip install -e . and specific packages like deepspeed and s2wrapper.datasets.load_dataset("nyu-visionx/VSI-Bench").bash evaluate_all_in-one.sh --model all --num_processes 8 --benchmark vsibench.Highlighted Details
lmms-eval toolkit.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The authors acknowledge that some imperfections may persist in the benchmark despite quality refinement efforts. Evaluation results for open-source models might differ slightly from published tables due to ongoing data refinement.
3 months ago
1 day