Discover and explore top open-source AI tools and projects—updated daily.
Vision-language model for 3D reconstruction and spatial reasoning
Top 96.9% on SourcePulse
VLM-3R addresses the challenge of enabling vision-language models (VLMs) to understand and reason about 3D spatial environments from monocular video input. It targets researchers and developers working on embodied AI, robotics, and spatial computing who need to equip models with human-like visual-spatial intelligence. The primary benefit is the ability to perform deep spatial understanding and instruction-aligned 3D reconstruction from single-camera video streams without requiring depth sensors or pre-existing 3D maps.
How It Works
VLM-3R integrates a pre-trained VLM with a novel 3D reconstructive instruction tuning approach. It processes monocular video frames using a geometry encoder (CUT3R) to derive implicit 3D tokens representing spatial understanding. A key innovation is the Spatial-Visual–View Fusion technique, which combines these 3D geometric tokens, per-view camera tokens, and 2D appearance features. This fused representation is then fed into the VLM, allowing it to align real-world spatial context with language instructions for tasks like spatial assistance and embodied reasoning. This approach avoids the limitations of existing VLMs that struggle with spatial context from monocular video and specialized 3D-LLMs that depend on external 3D data.
Quick Start & Requirements
conda create -n vlm3r python=3.10
), and install dependencies using pip install -e ".[train]"
. Specific versions of PyTorch (2.1.1), torchvision (0.16.1), and CUDA (12.1) are required. FlashAttention (v2.7.1.post1) and other libraries like decord
and openai
are also needed. The CUT3R submodule requires additional setup and dependency installation.Highlighted Details
Maintenance & Community
The project is led by authors from UT Austin, XMU, TAMU, UCR, and UNC, with Meta also contributing. Recent updates include the release of inference scripts, training/evaluation scripts, and datasets. The project acknowledges contributions from CUT3R, LLaVA-NeXT, and thinking-in-space.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. However, given its foundation on LLaVA-NeXT and the use of Hugging Face datasets, it is likely to follow permissive licenses, but users should verify specific terms for commercial use.
Limitations & Caveats
The raw video data from datasets like ScanNet is not provided and must be downloaded and processed separately by the user. The data generation code for the route plan task in VSiBench is still pending release.
2 weeks ago
Inactive