Discover and explore top open-source AI tools and projects—updated daily.
cpystanVision-language model for spatial understanding and measurement
Top 61.8% on SourcePulse
SD-VLM introduces a novel approach to spatial understanding by integrating depth information directly into Vision-Language Models (VLMs). Targeting researchers and developers working on scene comprehension, robotics, and augmented reality, it enables more accurate spatial reasoning and measurement from images, enhancing the capabilities of existing VLM architectures.
How It Works
SD-VLM builds upon the LLaVA-1.5 architecture, enhancing it with depth-encoded representations. The core innovation lies in fusing visual features with depth data, allowing the model to perceive and reason about the 3D structure of scenes. This depth-aware processing is achieved through specific vision towers and depth estimation models, enabling more precise spatial measurements and understanding compared to standard VLMs.
Quick Start & Requirements
Installation involves cloning the repository and setting up a Conda environment with Python 3.10. Key dependencies include PyTorch, Hugging Face Transformers, and specific pre-trained models: LLaVA-1.5-7B as the base, clip-vit-large-patch14-336 as the vision tower, and depth_anything_v2_vitl for depth estimation. Training can be performed efficiently using LoRA on 8 V100 GPUs.
pip install -e . (after setting up Conda environment and cloning repo)pip install -e ".[train]"clip-vit-large-patch14-336, depth_anything_v2_vitl.Highlighted Details
Maintenance & Community
The project is associated with the NeurIPS 2025 publication. Specific details regarding active maintenance, community channels (like Discord/Slack), or a public roadmap are not provided in the README.
Licensing & Compatibility
The README does not explicitly state the software license. Compatibility for commercial use or linking with closed-source projects cannot be determined from the provided information.
Limitations & Caveats
The README does not detail known limitations, alpha status, or specific unsupported platforms. The evaluation process requires an API key for GPT-4-Turbo, which may be a consideration for users without access.
3 months ago
Inactive