open-pi-zero  by allenzren

Vision-language-action model re-implementation from "Physical Intelligence" paper

created 8 months ago
1,051 stars

Top 36.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides an open-source re-implementation of the Pi0 vision-language-action (VLA) model from the Physical Intelligence paper. It targets researchers and engineers working on embodied AI and robotics, offering a modular architecture for integrating visual, language, and action modalities to control robotic agents.

How It Works

The model employs a Mixture-of-Experts (MoE) or Mixture-of-Transformers (MoT) architecture, leveraging a pre-trained 3B PaliGemma VLM and a new 0.315B action expert. It utilizes block-wise causal masking, allowing the VLM to attend to itself, the proprioception module (sharing weights with the action expert) to attend to itself and the VLM, and the action module to attend to all. Training is performed using flow matching loss on the action expert's output.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using uv sync (or venv/conda). Requires cloning the SimplerEnv fork to the same directory.
  • Prerequisites: Python, uv (recommended), Hugging Face Transformers cache for PaliGemma weights.
  • Setup: Set environment variables (VLA_DATA_DIR, VLA_LOG_DIR, VLA_WANDB_ENTITY) via source scripts/set_path.sh. Download PaliGemma weights.
  • Resources: Training requires significant resources (e.g., L40 or H100 GPUs). Inference on RTX 4090 shows ~75ms with bf16 and torch.compile.
  • Links: SimplerEnv fork, PaliGemma weights.

Highlighted Details

  • Implements a MoE-like architecture with block-wise causal masking.
  • Supports fine-tuning with pre-trained PaliGemma-3B.
  • Provides pre-trained checkpoints for "Bridge" and "Fractal" datasets.
  • Includes evaluation results on various robotic tasks in the Simpler environment.

Maintenance & Community

The project is maintained by the author, with acknowledgments to contributors from Open-source PaliGemma, Octo, and dlimp. Discussions can be initiated via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is an implementation based on the author's understanding of the Pi0 paper, and users are encouraged to report potential misunderstandings or bugs. torch_tensorrt compilation currently fails silently. Training performance may be affected when using QLoRA.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
199 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.