Discover and explore top open-source AI tools and projects—updated daily.
diankun-wuBoosting MLLM spatial intelligence with video input
Top 69.9% on SourcePulse
Spatial-MLLM enhances multimodal large language models (MLLMs) for visual-based spatial intelligence tasks, enabling better understanding and reasoning about scenes from video input. It targets researchers and developers working on advanced AI for spatial understanding, offering state-of-the-art performance on benchmarks like VSI-Bench.
How It Works
The architecture integrates a 2D visual encoder, a spatial encoder initialized from a visual geometry foundation model, a connector, and an LLM backbone. This design allows for explicit spatial reasoning by leveraging geometric priors. During inference, a space-aware frame sampling strategy is employed to optimize frame selection under GPU memory constraints, prioritizing spatially informative frames.
Quick Start & Requirements
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 (adjust CUDA version as needed), transformers, accelerate, qwen_vl_utils, decord, ray, Levenshtein, tyro, and flash-attn.python scripts/inference.py. Requires ~13GB VRAM using bfloat16 precision.python evaluate/eval_vsibench.py or bash scripts/evaluate_vsibench.sh.Highlighted Details
Spatial-MLLM-subset-sft model and VSI-Bench evaluation code.Maintenance & Community
The project is associated with Tsinghua University. Key components are inspired by repositories like thinking-in-space, VGGT, Qwen2.5-VL, open-r1, and R1-V. A roadmap includes releasing the full model, space-aware frame sampling code, training code, and the Spatial-MLLM-120k dataset.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The full Spatial-MLLM model, training code, and evaluation code for ScanQA and SQA3D are still under development and planned for future release. The current release focuses on a subset model and VSI-Bench evaluation.
6 days ago
Inactive