Discover and explore top open-source AI tools and projects—updated daily.
Unified Video Action Model for robotics
Top 96.2% on SourcePulse
This repository provides the official PyTorch implementation of the Unified Video Action (UVA) model, designed for robotic manipulation tasks. It enables robots to learn from diverse data sources, including video and action sequences, and can be applied to both simulated and real-world scenarios. The target audience includes robotics researchers and engineers looking to develop more versatile and capable manipulation systems.
How It Works
UVA employs a two-stage training approach, first focusing on video generation and then fine-tuning on joint video and action tasks. This strategy is found to yield better performance than simultaneous training. The model leverages pretrained VAE and image generation (MAR) models, and its architecture can be extended to incorporate multi-modal data like sound and force.
Quick Start & Requirements
conda
and mamba
to create an environment from conda_environment.yml
.gdown
for checkpoint downloads.eval_sim.py
for simulation testing. Real-world experiments require ARX X5 robot setup and specific UMI codebase modifications.Highlighted Details
Maintenance & Community
The project acknowledges contributions from Diffusion Policy and MAR. Further community interaction details (e.g., Discord/Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
The repository is provided under the MIT license, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The README notes that UVA's performance may be constrained by model size, suggesting larger models for more complex tasks. Future work plans include pretraining on additional video data and exploring alternative architectures like flow matching.
1 month ago
Inactive