SD-VLM by cpystan

Vision-language model for spatial understanding and measurement

Created 10 months ago

486 stars

Top 62.6% on SourcePulse

Project Summary

SD-VLM introduces a novel approach to spatial understanding by integrating depth information directly into Vision-Language Models (VLMs). Targeting researchers and developers working on scene comprehension, robotics, and augmented reality, it enables more accurate spatial reasoning and measurement from images, enhancing the capabilities of existing VLM architectures.

How It Works

SD-VLM builds upon the LLaVA-1.5 architecture, enhancing it with depth-encoded representations. The core innovation lies in fusing visual features with depth data, allowing the model to perceive and reason about the 3D structure of scenes. This depth-aware processing is achieved through specific vision towers and depth estimation models, enabling more precise spatial measurements and understanding compared to standard VLMs.

Quick Start & Requirements

Installation involves cloning the repository and setting up a Conda environment with Python 3.10. Key dependencies include PyTorch, Hugging Face Transformers, and specific pre-trained models: LLaVA-1.5-7B as the base, clip-vit-large-patch14-336 as the vision tower, and depth_anything_v2_vitl for depth estimation. Training can be performed efficiently using LoRA on 8 V100 GPUs.

Primary Install: pip install -e . (after setting up Conda environment and cloning repo)
Additional Training Packages: pip install -e ".[train]"
Prerequisites: Python 3.10, Conda, PyTorch, Hugging Face libraries, clip-vit-large-patch14-336, depth_anything_v2_vitl.
Links: Project Page, Arxiv, Data, Model Zoo

Highlighted Details

Accepted to NeurIPS 2025, indicating significant research contribution.
Introduces the MSMU (Massive Spatial Measuring and Understanding) dataset for instruction tuning and benchmarking.
Supports efficient LoRA finetuning, reducing computational resource requirements for training.

Maintenance & Community

The project is associated with the NeurIPS 2025 publication. Specific details regarding active maintenance, community channels (like Discord/Slack), or a public roadmap are not provided in the README.

Licensing & Compatibility

The README does not explicitly state the software license. Compatibility for commercial use or linking with closed-source projects cannot be determined from the provided information.

Limitations & Caveats

The README does not detail known limitations, alpha status, or specific unsupported platforms. The evaluation process requires an API key for GPT-4-Turbo, which may be a consideration for users without access.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days