LAPA by LatentActionPretraining

VLA pretraining via unsupervised latent action learning from video

Created 1 year ago

435 stars

Top 68.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jianwei Yang

Research Scientist at Meta Superintelligence Lab

Project Summary

LAPA (Latent Action Pretraining) is an unsupervised approach for pretraining Vision-Language-Action (VLA) models, targeting researchers and engineers in robotics and embodied AI. It enables the creation of state-of-the-art VLA models with significantly improved pretraining efficiency, outperforming models trained with ground-truth action labels.

How It Works

LAPA leverages latent action quantization to pretrain VLA models without requiring explicit robot action labels. It quantizes actions into a discrete latent space, allowing for unsupervised learning from video data. This approach achieves over 30x greater pretraining efficiency compared to conventional methods.

Quick Start & Requirements

Install: Clone the repository and install requirements using pip install -r requirements.txt within a conda environment.
Prerequisites: Python 3.10, conda.
Model Download: Download three checkpoint files (tokenizer.model, vqgan, params) from Huggingface.
Inference: Run python -m latent_pretraining.inference after setting up checkpoints.
Fine-tuning: Requires preprocessing datasets into a specific JSON format and uses shell scripts (scripts/finetune_real.sh, scripts/finetune_simpler.sh). Fine-tuning experiments were conducted with 4x 80GB A100 GPUs.
Resources: Fine-tuning on SIMPLER rollout trajectories is supported.
Links: Project, Paper, Models

Highlighted Details

Achieves new state-of-the-art performance in VLA models.
Demonstrates over 30x pretraining efficiency gain.
Won the CoRL 2024 LangRob Workshop Best Paper Award.
Supports fine-tuning on real-world trajectories and SIMPLER simulation data.

Maintenance & Community

Codebase is based on Large-World-Model, Phenaki, and OpenVLA repositories.
Contact: latentactionpretraining@gmail.com

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The output of the inference script is in a latent action space ($8^4$ dimensions), not the real action space, requiring fine-tuning to map to physical robot actions. Custom dataset training for latent action quantization requires modifying data loading code to match the Something-Something v2 dataset structure.

Health Check

Last Commit

11 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days