unified_video_action by ShuangLI59

Unified Video Action Model for robotics

Created 10 months ago

314 stars

Top 86.1% on SourcePulse

Project Summary

This repository provides the official PyTorch implementation of the Unified Video Action (UVA) model, designed for robotic manipulation tasks. It enables robots to learn from diverse data sources, including video and action sequences, and can be applied to both simulated and real-world scenarios. The target audience includes robotics researchers and engineers looking to develop more versatile and capable manipulation systems.

How It Works

UVA employs a two-stage training approach, first focusing on video generation and then fine-tuning on joint video and action tasks. This strategy is found to yield better performance than simultaneous training. The model leverages pretrained VAE and image generation (MAR) models, and its architecture can be extended to incorporate multi-modal data like sound and force.

Quick Start & Requirements

Installation: Use conda and mamba to create an environment from conda_environment.yml.
Prerequisites: PyTorch, CUDA (implied for GPU usage), gdown for checkpoint downloads.
Setup: Download checkpoints and run eval_sim.py for simulation testing. Real-world experiments require ARX X5 robot setup and specific UMI codebase modifications.
Resources: Training is recommended on at least 4 GPUs, with 8 GPUs and 2 days per stage suggested for UMI tasks. Large-scale UMI dataset processing requires ~500GB of RAM.
Links: Project page, Paper, Colab (PushT)

Highlighted Details

Supports training on custom tasks by implementing new datasets, environment runners, and configuration files.
Allows integration of custom models by defining configuration files, workspaces, and policy files.
Includes detailed instructions for processing and loading large-scale UMI multi-task datasets efficiently.
Provides checkpoints for simulation (PushT, PushT-M, Libero10) and real-world (UMI Multitask ARX X5) experiments.

Maintenance & Community

The project acknowledges contributions from Diffusion Policy and MAR. Further community interaction details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository is provided under the MIT license, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that UVA's performance may be constrained by model size, suggesting larger models for more complex tasks. Future work plans include pretraining on additional video data and exploring alternative architectures like flow matching.

unified_video_action by ShuangLI59

Explore Similar Projects

Motus by thu-ml

Hybrid-VLA by PKU-HMI-Lab

lynx-llm by bytedance

Open-R1-Video by Wang-Xiaodong1899

GroundingGPT by lzw-lzw

GPT4Scene-and-VLN-R1 by Qi-Zhangyang

molmoact by allenai

CogACT by microsoft

OpenDriveVLA by DriveVLA

VIMA by vimalabs

EasyR1 by hiyouga

Isaac-GR00T by NVIDIA