LAPA  by LatentActionPretraining

VLA pretraining via unsupervised latent action learning from video

created 9 months ago
340 stars

Top 82.2% on sourcepulse

GitHubView on GitHub
Project Summary

LAPA (Latent Action Pretraining) is an unsupervised approach for pretraining Vision-Language-Action (VLA) models, targeting researchers and engineers in robotics and embodied AI. It enables the creation of state-of-the-art VLA models with significantly improved pretraining efficiency, outperforming models trained with ground-truth action labels.

How It Works

LAPA leverages latent action quantization to pretrain VLA models without requiring explicit robot action labels. It quantizes actions into a discrete latent space, allowing for unsupervised learning from video data. This approach achieves over 30x greater pretraining efficiency compared to conventional methods.

Quick Start & Requirements

  • Install: Clone the repository and install requirements using pip install -r requirements.txt within a conda environment.
  • Prerequisites: Python 3.10, conda.
  • Model Download: Download three checkpoint files (tokenizer.model, vqgan, params) from Huggingface.
  • Inference: Run python -m latent_pretraining.inference after setting up checkpoints.
  • Fine-tuning: Requires preprocessing datasets into a specific JSON format and uses shell scripts (scripts/finetune_real.sh, scripts/finetune_simpler.sh). Fine-tuning experiments were conducted with 4x 80GB A100 GPUs.
  • Resources: Fine-tuning on SIMPLER rollout trajectories is supported.
  • Links: Project, Paper, Models

Highlighted Details

  • Achieves new state-of-the-art performance in VLA models.
  • Demonstrates over 30x pretraining efficiency gain.
  • Won the CoRL 2024 LangRob Workshop Best Paper Award.
  • Supports fine-tuning on real-world trajectories and SIMPLER simulation data.

Maintenance & Community

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The output of the inference script is in a latent action space ($8^4$ dimensions), not the real action space, requiring fine-tuning to map to physical robot actions. Custom dataset training for latent action quantization requires modifying data loading code to match the Something-Something v2 dataset structure.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
100 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.