Vision-language-action model for robot control, trained on real robot episodes
Top 72.0% on sourcepulse
SpatialVLA is a spatial-enhanced vision-language-action model designed for real-world robotics tasks. It offers a HuggingFace-based, concise implementation trained on over 1.1 million real robot episodes, targeting researchers and developers in robotics and embodied AI who need efficient and performant models for robot control.
How It Works
SpatialVLA leverages the PaLiGemma2 model as its backbone, integrating spatial representations to improve visual-language-action understanding. This approach allows for more nuanced interpretation of spatial relationships in robot environments, leading to better action prediction and control. The model is designed for efficient performance and ease of deployment within the HuggingFace ecosystem.
Quick Start & Requirements
requirements.txt
, including a custom dlimp
.Highlighted Details
Maintenance & Community
The project is actively developed, with recent updates simplifying code structure and fixing dependencies. An advanced version leveraging lerobot
is under development. Community interaction is encouraged via GitHub issues and discussions.
Licensing & Compatibility
Released under the MIT license, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The README mentions that some tasks (Open Top Drawer and Place Apple) were omitted from evaluation due to near-zero scores across most policies. An advanced version (SpatialVLA2) is still under development.
1 month ago
Inactive