VLM-3R by VITA-Group

Vision-language model for 3D reconstruction and spatial reasoning

Created 7 months ago

321 stars

Top 84.7% on SourcePulse

Project Summary

VLM-3R addresses the challenge of enabling vision-language models (VLMs) to understand and reason about 3D spatial environments from monocular video input. It targets researchers and developers working on embodied AI, robotics, and spatial computing who need to equip models with human-like visual-spatial intelligence. The primary benefit is the ability to perform deep spatial understanding and instruction-aligned 3D reconstruction from single-camera video streams without requiring depth sensors or pre-existing 3D maps.

How It Works

VLM-3R integrates a pre-trained VLM with a novel 3D reconstructive instruction tuning approach. It processes monocular video frames using a geometry encoder (CUT3R) to derive implicit 3D tokens representing spatial understanding. A key innovation is the Spatial-Visual–View Fusion technique, which combines these 3D geometric tokens, per-view camera tokens, and 2D appearance features. This fused representation is then fed into the VLM, allowing it to align real-world spatial context with language instructions for tasks like spatial assistance and embodied reasoning. This approach avoids the limitations of existing VLMs that struggle with spatial context from monocular video and specialized 3D-LLMs that depend on external 3D data.

Quick Start & Requirements

Installation: Clone the repository, initialize submodules, create a conda environment (conda create -n vlm3r python=3.10), and install dependencies using pip install -e ".[train]". Specific versions of PyTorch (2.1.1), torchvision (0.16.1), and CUDA (12.1) are required. FlashAttention (v2.7.1.post1) and other libraries like decord and openai are also needed. The CUT3R submodule requires additional setup and dependency installation.
Prerequisites: Python 3.10, CUDA 12.1, PyTorch 2.1.1, and specific versions of other Python packages. Access to GPU(s) is necessary for running the model.
Setup Time: Environment setup and dependency installation can take approximately 15-30 minutes, depending on network speed and system configuration.
Resources: Running inference and training will require significant GPU memory and compute power.
Links: Code (GitHub), Dataset (HF), VSTiBench (HF)

Highlighted Details

End-to-end monocular video 3D understanding without external sensors.
Instruction tuning with over 200K 3D reconstructive QA pairs.
Introduces VSTI-Bench, a benchmark for spatio-temporal reasoning in dynamic 3D environments.
Utilizes CUT3R for extracting implicit 3D geometric tokens.

Maintenance & Community

The project is led by authors from UT Austin, XMU, TAMU, UCR, and UNC, with Meta also contributing. Recent updates include the release of inference scripts, training/evaluation scripts, and datasets. The project acknowledges contributions from CUT3R, LLaVA-NeXT, and thinking-in-space.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. However, given its foundation on LLaVA-NeXT and the use of Hugging Face datasets, it is likely to follow permissive licenses, but users should verify specific terms for commercial use.

Limitations & Caveats

The raw video data from datasets like ScanNet is not provided and must be downloaded and processed separately by the user. The data generation code for the route plan task in VSiBench is still pending release.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days