unified_video_action  by ShuangLI59

Unified Video Action Model for robotics

Created 6 months ago
266 stars

Top 96.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official PyTorch implementation of the Unified Video Action (UVA) model, designed for robotic manipulation tasks. It enables robots to learn from diverse data sources, including video and action sequences, and can be applied to both simulated and real-world scenarios. The target audience includes robotics researchers and engineers looking to develop more versatile and capable manipulation systems.

How It Works

UVA employs a two-stage training approach, first focusing on video generation and then fine-tuning on joint video and action tasks. This strategy is found to yield better performance than simultaneous training. The model leverages pretrained VAE and image generation (MAR) models, and its architecture can be extended to incorporate multi-modal data like sound and force.

Quick Start & Requirements

  • Installation: Use conda and mamba to create an environment from conda_environment.yml.
  • Prerequisites: PyTorch, CUDA (implied for GPU usage), gdown for checkpoint downloads.
  • Setup: Download checkpoints and run eval_sim.py for simulation testing. Real-world experiments require ARX X5 robot setup and specific UMI codebase modifications.
  • Resources: Training is recommended on at least 4 GPUs, with 8 GPUs and 2 days per stage suggested for UMI tasks. Large-scale UMI dataset processing requires ~500GB of RAM.
  • Links: Project page, Paper, Colab (PushT)

Highlighted Details

  • Supports training on custom tasks by implementing new datasets, environment runners, and configuration files.
  • Allows integration of custom models by defining configuration files, workspaces, and policy files.
  • Includes detailed instructions for processing and loading large-scale UMI multi-task datasets efficiently.
  • Provides checkpoints for simulation (PushT, PushT-M, Libero10) and real-world (UMI Multitask ARX X5) experiments.

Maintenance & Community

The project acknowledges contributions from Diffusion Policy and MAR. Further community interaction details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository is provided under the MIT license, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that UVA's performance may be constrained by model size, suggesting larger models for more complex tasks. Future work plans include pretraining on additional video data and exploring alternative architectures like flow matching.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.