SpatialVLA  by SpatialVLA

Vision-language-action model for robot control, trained on real robot episodes

created 6 months ago
412 stars

Top 72.0% on sourcepulse

GitHubView on GitHub
Project Summary

SpatialVLA is a spatial-enhanced vision-language-action model designed for real-world robotics tasks. It offers a HuggingFace-based, concise implementation trained on over 1.1 million real robot episodes, targeting researchers and developers in robotics and embodied AI who need efficient and performant models for robot control.

How It Works

SpatialVLA leverages the PaLiGemma2 model as its backbone, integrating spatial representations to improve visual-language-action understanding. This approach allows for more nuanced interpretation of spatial relationships in robot environments, leading to better action prediction and control. The model is designed for efficient performance and ease of deployment within the HuggingFace ecosystem.

Quick Start & Requirements

  • Install/Run: Load model and perform inference using HuggingFace Transformers (>= 4.47.0).
  • Prerequisites: Python >= 3.10, transformers >= 4.47.0, PyTorch. Requires ~8.5GB GPU memory for inference. For training/fine-tuning, additional dependencies are listed in requirements.txt, including a custom dlimp.
  • Setup: Basic inference setup is straightforward via HuggingFace. Training requires cloning the repo, setting up a Python 3.10 environment, and installing requirements.
  • Links: Paper, Project Page, Model Zoo.

Highlighted Details

  • Achieves state-of-the-art performance on various benchmarks, including SimplerEnv and LIBERO, with faster inference speeds.
  • Trained on 1.1 million real robot episodes from OXE and RH20T datasets.
  • Supports zero-shot and fine-tuning (including LoRA) capabilities.
  • Built entirely within the HuggingFace ecosystem for easy integration and deployment.

Maintenance & Community

The project is actively developed, with recent updates simplifying code structure and fixing dependencies. An advanced version leveraging lerobot is under development. Community interaction is encouraged via GitHub issues and discussions.

Licensing & Compatibility

Released under the MIT license, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README mentions that some tasks (Open Top Drawer and Place Apple) were omitted from evaluation due to near-zero scores across most policies. An advanced version (SpatialVLA2) is still under development.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
157 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.