SpatialVLA by SpatialVLA

Vision-language-action model for robot control, trained on real robot episodes

Created 11 months ago

623 stars

Top 53.1% on SourcePulse

Project Summary

SpatialVLA is a spatial-enhanced vision-language-action model designed for real-world robotics tasks. It offers a HuggingFace-based, concise implementation trained on over 1.1 million real robot episodes, targeting researchers and developers in robotics and embodied AI who need efficient and performant models for robot control.

How It Works

SpatialVLA leverages the PaLiGemma2 model as its backbone, integrating spatial representations to improve visual-language-action understanding. This approach allows for more nuanced interpretation of spatial relationships in robot environments, leading to better action prediction and control. The model is designed for efficient performance and ease of deployment within the HuggingFace ecosystem.

Quick Start & Requirements

Install/Run: Load model and perform inference using HuggingFace Transformers (>= 4.47.0).
Prerequisites: Python >= 3.10, transformers >= 4.47.0, PyTorch. Requires ~8.5GB GPU memory for inference. For training/fine-tuning, additional dependencies are listed in requirements.txt, including a custom dlimp.
Setup: Basic inference setup is straightforward via HuggingFace. Training requires cloning the repo, setting up a Python 3.10 environment, and installing requirements.
Links: Paper, Project Page, Model Zoo.

Highlighted Details

Achieves state-of-the-art performance on various benchmarks, including SimplerEnv and LIBERO, with faster inference speeds.
Trained on 1.1 million real robot episodes from OXE and RH20T datasets.
Supports zero-shot and fine-tuning (including LoRA) capabilities.
Built entirely within the HuggingFace ecosystem for easy integration and deployment.

Maintenance & Community

The project is actively developed, with recent updates simplifying code structure and fixing dependencies. An advanced version leveraging lerobot is under development. Community interaction is encouraged via GitHub issues and discussions.

Licensing & Compatibility

Released under the MIT license, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README mentions that some tasks (Open Top Drawer and Place Apple) were omitted from evaluation due to near-zero scores across most policies. An advanced version (SpatialVLA2) is still under development.

SpatialVLA by SpatialVLA

Explore Similar Projects

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

RoboVLMs by Robot-VLAs

vla0 by NVlabs

molmoact by allenai

X-VLA by 2toinf

RoboFlamingo by RoboFlamingo

CogACT by microsoft

OpenDriveVLA by DriveVLA

open-pi-zero by allenzren

awesome-embodied-vla-va-vln by jonyzhang2023

VLA-Adapter by OpenHelix-Team

Isaac-GR00T by NVIDIA