Psi0  by physical-superintelligence-lab

Humanoid loco-manipulation foundation model

Created 1 month ago
1,434 stars

Top 28.0% on SourcePulse

GitHubView on GitHub
Project Summary

Psi-Zero: A Foundation Model for Humanoid Loco-Manipulation

Psi-Zero is an open foundation model designed for dexterous humanoid loco-manipulation, aiming to advance universal humanoid intelligence. It addresses the challenge of rapidly acquiring new, complex manipulation skills by enabling fine-tuning with minimal real-world data. This project targets researchers and engineers in robotics and AI, offering a powerful base model that significantly reduces the data and time required to teach robots new tasks.

How It Works

The Psi-Zero architecture comprises two primary end-to-end trained components: a vision-language backbone (System-2) and a multimodal diffusion transformer action expert (System-1). The backbone, based on Qwen3-VL-2B-Instruct, extracts features from observations and instructions. These features condition a flow-based multimodal diffusion transformer, inspired by Stable Diffusion 3, which predicts future whole-body action chunks. At the lowest level (System-0), an RL-based tracking controller ensures precise physical execution of the predicted actions. This approach allows the model to learn task semantics and visual representations from large-scale egocentric videos, then adapt to real-world embodiment dynamics through post-training on limited teleoperated robot data.

Quick Start & Requirements

Installation involves cloning the repository and managing Python dependencies with uv. Key commands include setting up a virtual environment (uv venv .venv-psi, source .venv-psi/bin/activate) and synchronizing packages (uv sync --all-groups). A specific requirement is flash_attn==2.7.4.post1, and Python 3.10 is used for environment management. Pre-trained models and datasets are available on Hugging Face.

Highlighted Details

  • Capable of acquiring new long-horizontal dexterous loco-manipulation skills through fine-tuning with as few as 80 trajectories.
  • Leverages a Qwen3-VL-2B-Instruct backbone for vision-language understanding.
  • Employs a multimodal diffusion transformer for predicting action chunks, enabling efficient fusion of visual, linguistic, and action representations.
  • The model is designed to learn from large-scale human egocentric videos and adapt to real-world robot data.

Maintenance & Community

The project lists several contributors. No explicit links to community channels (e.g., Discord, Slack) or a public roadmap were found in the provided README.

Licensing & Compatibility

This project is licensed under the Apache License 2.0. This license is permissive and generally compatible with commercial use and linking in closed-source projects.

Limitations & Caveats

Installation of the SIMPLE humanoid benchmarking simulator is marked as "Coming soon." Similarly, motion-planning based data generation and teleoperation within the simulator are also pending. The troubleshooting section indicates potential issues with specific dependencies like lerobot stack, evdev, and wandb, as well as considerations for GPU memory on newer hardware.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
16
Star History
1,470 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.