Vision-language model for precise spatial understanding, detailed in a research paper
Top 92.8% on sourcepulse
SpatialBot is a Vision-Language Model (VLM) designed for precise spatial understanding, enabling robots to interpret and interact with their environment using RGB and depth information. It targets researchers and developers in robotics and AI, offering enhanced spatial reasoning capabilities for tasks like pick-and-place operations.
How It Works
SpatialBot builds upon the Bunny multimodal model architecture, integrating vision encoders (CLIP, SigLIP, EVA-CLIP) with various large language models (LLMs) including Phi-2, Phi-3, Qwen-1.5, and Llama-3. It processes both RGB and depth images, allowing for a richer understanding of scene geometry and object locations. The model can be fine-tuned for specific embodiment tasks, learning to predict positional deltas or key points for robotic control.
Quick Start & Requirements
pip install torch transformers accelerate pillow numpy
russellrobin/bunny:latest
with instructions to upgrade.Highlighted Details
Maintenance & Community
The project is associated with multiple institutions including SJTU, Stanford, BAAI, PKU, and Oxford. It is built upon the Bunny model. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
Limitations & Caveats
The project is associated with an ICRA 2025 paper, suggesting it may still be in a research-heavy development phase. Specific details on the SpatialQA-E dataset availability and embodied SpatialBot checkpoints are noted as "coming soon."
2 months ago
1 day