SpatialBot  by BAAI-DCAI

Vision-language model for precise spatial understanding, detailed in a research paper

created 1 year ago
285 stars

Top 92.8% on sourcepulse

GitHubView on GitHub
Project Summary

SpatialBot is a Vision-Language Model (VLM) designed for precise spatial understanding, enabling robots to interpret and interact with their environment using RGB and depth information. It targets researchers and developers in robotics and AI, offering enhanced spatial reasoning capabilities for tasks like pick-and-place operations.

How It Works

SpatialBot builds upon the Bunny multimodal model architecture, integrating vision encoders (CLIP, SigLIP, EVA-CLIP) with various large language models (LLMs) including Phi-2, Phi-3, Qwen-1.5, and Llama-3. It processes both RGB and depth images, allowing for a richer understanding of scene geometry and object locations. The model can be fine-tuned for specific embodiment tasks, learning to predict positional deltas or key points for robotic control.

Quick Start & Requirements

  • Install: pip install torch transformers accelerate pillow numpy
  • Prerequisites: Python, PyTorch, Transformers library. GPU recommended for optimal performance.
  • Model: Download SpatialBot-3B from Hugging Face.
  • Demo: See provided Python code snippet for direct inference.
  • Docker: Available via russellrobin/bunny:latest with instructions to upgrade.
  • Resources: Requires downloading model weights (e.g., SpatialBot-3B).

Highlighted Details

  • Supports multi-image and RGB-D inputs for enhanced spatial context.
  • Fine-tuned on SpatialQA and SpatialQA-E datasets for VQA and robotic embodiment tasks.
  • Offers pre-trained models and scripts for pre-training and LoRA fine-tuning.
  • Includes SpatialBench for evaluation of spatial understanding capabilities.

Maintenance & Community

The project is associated with multiple institutions including SJTU, Stanford, BAAI, PKU, and Oxford. It is built upon the Bunny model. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

  • Model Checkpoints: Restricted by the licenses of underlying LLMs (Bunny, LLaMA 3, Phi-2, Phi-3, QWen-1.5, GPT-4).
  • Datasets: SpatialQA is CC-BY-4.0. SpatialQA-E dataset availability is pending.
  • Commercial Use: Compatibility depends on the specific LLM licenses used.

Limitations & Caveats

The project is associated with an ICRA 2025 paper, suggesting it may still be in a research-heavy development phase. Specific details on the SpatialQA-E dataset availability and embodied SpatialBot checkpoints are noted as "coming soon."

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
38 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.