Robotics vision-language-action models
Top 12.1% on sourcepulse
This repository provides open-source Vision-Language-Action (VLA) models, specifically the $\pi_0$ (diffusion-based) and $\pi_0$-FAST (autoregressive) models, for robotics applications. It offers pre-trained checkpoints and fine-tuning examples, targeting robotics researchers and practitioners looking to adapt VLA models to their own robot platforms and tasks.
How It Works
The project offers two VLA models: $\pi_0$, a flow-based diffusion model, and $\pi_0$-FAST, an autoregressive model utilizing the FAST action tokenizer. Both are trained on extensive robot data (10k+ hours) and are designed for tasks involving visual perception, language understanding, and robotic action generation. The models can be used for inference directly or fine-tuned on custom datasets, enabling adaptation to specific robot hardware and manipulation skills.
Quick Start & Requirements
git clone --recurse-submodules
) and use uv
for dependency management (uv sync
, uv pip install -e .
). Docker installation is also supported.Highlighted Details
Maintenance & Community
The project is published by the Physical Intelligence team. Further community interaction details (Discord/Slack, roadmap) are not explicitly detailed in the README.
Licensing & Compatibility
The README does not explicitly state the license type. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is experimental, and models are developed for specific robot platforms, with no guarantee of success when adapting to different hardware. Fine-tuning requires significant GPU memory and data preparation.
5 days ago
1 day