SpatialBot by BAAI-DCAI

Vision-language model for precise spatial understanding, detailed in a research paper

Created 1 year ago

327 stars

Top 83.6% on SourcePulse

Project Summary

SpatialBot is a Vision-Language Model (VLM) designed for precise spatial understanding, enabling robots to interpret and interact with their environment using RGB and depth information. It targets researchers and developers in robotics and AI, offering enhanced spatial reasoning capabilities for tasks like pick-and-place operations.

How It Works

SpatialBot builds upon the Bunny multimodal model architecture, integrating vision encoders (CLIP, SigLIP, EVA-CLIP) with various large language models (LLMs) including Phi-2, Phi-3, Qwen-1.5, and Llama-3. It processes both RGB and depth images, allowing for a richer understanding of scene geometry and object locations. The model can be fine-tuned for specific embodiment tasks, learning to predict positional deltas or key points for robotic control.

Quick Start & Requirements

Install: pip install torch transformers accelerate pillow numpy
Prerequisites: Python, PyTorch, Transformers library. GPU recommended for optimal performance.
Model: Download SpatialBot-3B from Hugging Face.
Demo: See provided Python code snippet for direct inference.
Docker: Available via russellrobin/bunny:latest with instructions to upgrade.
Resources: Requires downloading model weights (e.g., SpatialBot-3B).

Highlighted Details

Supports multi-image and RGB-D inputs for enhanced spatial context.
Fine-tuned on SpatialQA and SpatialQA-E datasets for VQA and robotic embodiment tasks.
Offers pre-trained models and scripts for pre-training and LoRA fine-tuning.
Includes SpatialBench for evaluation of spatial understanding capabilities.

Maintenance & Community

The project is associated with multiple institutions including SJTU, Stanford, BAAI, PKU, and Oxford. It is built upon the Bunny model. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

Model Checkpoints: Restricted by the licenses of underlying LLMs (Bunny, LLaMA 3, Phi-2, Phi-3, QWen-1.5, GPT-4).
Datasets: SpatialQA is CC-BY-4.0. SpatialQA-E dataset availability is pending.
Commercial Use: Compatibility depends on the specific LLM licenses used.

Limitations & Caveats

The project is associated with an ICRA 2025 paper, suggesting it may still be in a research-heavy development phase. Specific details on the SpatialQA-E dataset availability and embodied SpatialBot checkpoints are noted as "coming soon."

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days