Spatial-MLLM by diankun-wu

Boosting MLLM spatial intelligence with video input

Created 7 months ago

420 stars

Top 69.9% on SourcePulse

Project Summary

Spatial-MLLM enhances multimodal large language models (MLLMs) for visual-based spatial intelligence tasks, enabling better understanding and reasoning about scenes from video input. It targets researchers and developers working on advanced AI for spatial understanding, offering state-of-the-art performance on benchmarks like VSI-Bench.

How It Works

The architecture integrates a 2D visual encoder, a spatial encoder initialized from a visual geometry foundation model, a connector, and an LLM backbone. This design allows for explicit spatial reasoning by leveraging geometric priors. During inference, a space-aware frame sampling strategy is employed to optimize frame selection under GPU memory constraints, prioritizing spatially informative frames.

Quick Start & Requirements

Install: Clone the repository and set up a conda environment with Python 3.10. Install dependencies via pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 (adjust CUDA version as needed), transformers, accelerate, qwen_vl_utils, decord, ray, Levenshtein, tyro, and flash-attn.
Inference: Run python scripts/inference.py. Requires ~13GB VRAM using bfloat16 precision.
Evaluation: Download VSI-Bench dataset, extract it, and run python evaluate/eval_vsibench.py or bash scripts/evaluate_vsibench.sh.
Links: Project Page, arXiv Paper

Highlighted Details

Achieves SOTA performance on visual-based spatial reasoning tasks.
Utilizes a novel spatial encoder initialized from a visual geometry foundation model.
Employs a space-aware frame sampling strategy for efficient inference.
Released Spatial-MLLM-subset-sft model and VSI-Bench evaluation code.

Maintenance & Community

The project is associated with Tsinghua University. Key components are inspired by repositories like thinking-in-space, VGGT, Qwen2.5-VL, open-r1, and R1-V. A roadmap includes releasing the full model, space-aware frame sampling code, training code, and the Spatial-MLLM-120k dataset.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The full Spatial-MLLM model, training code, and evaluation code for ScanQA and SQA3D are still under development and planned for future release. The current release focuses on a subset model and VSI-Bench evaluation.

Health Check

Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days