Discover and explore top open-source AI tools and projects—updated daily.
Boosting MLLM spatial intelligence with video input
Top 79.8% on SourcePulse
Spatial-MLLM enhances multimodal large language models (MLLMs) for visual-based spatial intelligence tasks, enabling better understanding and reasoning about scenes from video input. It targets researchers and developers working on advanced AI for spatial understanding, offering state-of-the-art performance on benchmarks like VSI-Bench.
How It Works
The architecture integrates a 2D visual encoder, a spatial encoder initialized from a visual geometry foundation model, a connector, and an LLM backbone. This design allows for explicit spatial reasoning by leveraging geometric priors. During inference, a space-aware frame sampling strategy is employed to optimize frame selection under GPU memory constraints, prioritizing spatially informative frames.
Quick Start & Requirements
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
(adjust CUDA version as needed), transformers
, accelerate
, qwen_vl_utils
, decord
, ray
, Levenshtein
, tyro
, and flash-attn
.python scripts/inference.py
. Requires ~13GB VRAM using bfloat16 precision.python evaluate/eval_vsibench.py
or bash scripts/evaluate_vsibench.sh
.Highlighted Details
Spatial-MLLM-subset-sft
model and VSI-Bench evaluation code.Maintenance & Community
The project is associated with Tsinghua University. Key components are inspired by repositories like thinking-in-space
, VGGT
, Qwen2.5-VL
, open-r1
, and R1-V
. A roadmap includes releasing the full model, space-aware frame sampling code, training code, and the Spatial-MLLM-120k dataset.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The full Spatial-MLLM model, training code, and evaluation code for ScanQA and SQA3D are still under development and planned for future release. The current release focuses on a subset model and VSI-Bench evaluation.
2 months ago
Inactive