open-pi-zero by allenzren

Vision-language-action model re-implementation from "Physical Intelligence" paper

Created 1 year ago

1,338 stars

Top 29.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Luca Antiga

CTO of Lightning AI

Forrest Iandola

Author of SqueezeNet; Research Scientist at Meta

Project Summary

This repository provides an open-source re-implementation of the Pi0 vision-language-action (VLA) model from the Physical Intelligence paper. It targets researchers and engineers working on embodied AI and robotics, offering a modular architecture for integrating visual, language, and action modalities to control robotic agents.

How It Works

The model employs a Mixture-of-Experts (MoE) or Mixture-of-Transformers (MoT) architecture, leveraging a pre-trained 3B PaliGemma VLM and a new 0.315B action expert. It utilizes block-wise causal masking, allowing the VLM to attend to itself, the proprioception module (sharing weights with the action expert) to attend to itself and the VLM, and the action module to attend to all. Training is performed using flow matching loss on the action expert's output.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using uv sync (or venv/conda). Requires cloning the SimplerEnv fork to the same directory.
Prerequisites: Python, uv (recommended), Hugging Face Transformers cache for PaliGemma weights.
Setup: Set environment variables (VLA_DATA_DIR, VLA_LOG_DIR, VLA_WANDB_ENTITY) via source scripts/set_path.sh. Download PaliGemma weights.
Resources: Training requires significant resources (e.g., L40 or H100 GPUs). Inference on RTX 4090 shows ~75ms with bf16 and torch.compile.
Links: SimplerEnv fork, PaliGemma weights.

Highlighted Details

Implements a MoE-like architecture with block-wise causal masking.
Supports fine-tuning with pre-trained PaliGemma-3B.
Provides pre-trained checkpoints for "Bridge" and "Fractal" datasets.
Includes evaluation results on various robotic tasks in the Simpler environment.

Maintenance & Community

The project is maintained by the author, with acknowledgments to contributors from Open-source PaliGemma, Octo, and dlimp. Discussions can be initiated via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is an implementation based on the author's understanding of the Pi0 paper, and users are encouraged to report potential misunderstandings or bugs. torch_tensorrt compilation currently fails silently. Training performance may be affected when using QLoRA.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

40 stars in the last 30 days