Vision-language-action model re-implementation from "Physical Intelligence" paper
Top 36.4% on sourcepulse
This repository provides an open-source re-implementation of the Pi0 vision-language-action (VLA) model from the Physical Intelligence paper. It targets researchers and engineers working on embodied AI and robotics, offering a modular architecture for integrating visual, language, and action modalities to control robotic agents.
How It Works
The model employs a Mixture-of-Experts (MoE) or Mixture-of-Transformers (MoT) architecture, leveraging a pre-trained 3B PaliGemma VLM and a new 0.315B action expert. It utilizes block-wise causal masking, allowing the VLM to attend to itself, the proprioception module (sharing weights with the action expert) to attend to itself and the VLM, and the action module to attend to all. Training is performed using flow matching loss on the action expert's output.
Quick Start & Requirements
uv sync
(or venv
/conda
). Requires cloning the SimplerEnv
fork to the same directory.uv
(recommended), Hugging Face Transformers cache for PaliGemma weights.VLA_DATA_DIR
, VLA_LOG_DIR
, VLA_WANDB_ENTITY
) via source scripts/set_path.sh
. Download PaliGemma weights.torch.compile
.Highlighted Details
Maintenance & Community
The project is maintained by the author, with acknowledgments to contributors from Open-source PaliGemma, Octo, and dlimp. Discussions can be initiated via GitHub issues.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is an implementation based on the author's understanding of the Pi0 paper, and users are encouraged to report potential misunderstandings or bugs. torch_tensorrt
compilation currently fails silently. Training performance may be affected when using QLoRA.
6 months ago
1 day