VLA pretraining via unsupervised latent action learning from video
Top 82.2% on sourcepulse
LAPA (Latent Action Pretraining) is an unsupervised approach for pretraining Vision-Language-Action (VLA) models, targeting researchers and engineers in robotics and embodied AI. It enables the creation of state-of-the-art VLA models with significantly improved pretraining efficiency, outperforming models trained with ground-truth action labels.
How It Works
LAPA leverages latent action quantization to pretrain VLA models without requiring explicit robot action labels. It quantizes actions into a discrete latent space, allowing for unsupervised learning from video data. This approach achieves over 30x greater pretraining efficiency compared to conventional methods.
Quick Start & Requirements
pip install -r requirements.txt
within a conda
environment.conda
.tokenizer.model
, vqgan
, params
) from Huggingface.python -m latent_pretraining.inference
after setting up checkpoints.scripts/finetune_real.sh
, scripts/finetune_simpler.sh
). Fine-tuning experiments were conducted with 4x 80GB A100 GPUs.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The output of the inference script is in a latent action space ($8^4$ dimensions), not the real action space, requiring fine-tuning to map to physical robot actions. Custom dataset training for latent action quantization requires modifying data loading code to match the Something-Something v2 dataset structure.
6 months ago
Inactive