Motus by thu-ml

Unified latent action world model for robotics

Created 2 months ago

783 stars

Top 44.8% on SourcePulse

Project Summary

Summary

Motus is a unified latent action world model integrating video generation, understanding, and action capabilities via a Mixture-of-Transformers (MoT) architecture and a flexible UniDiffuser-style scheduler. It leverages optical flow for pixel-level "delta action" extraction, enabling large-scale pretraining and diverse modeling modes (World Models, VLA, Video Gen). This provides a powerful, unified framework for robotics and generative AI research, simplifying complex sequential decision-making tasks.

How It Works

Motus employs a Mixture-of-Transformers (MoT) with three experts: video generation (VGM), vision-language (VLM), and action. A UniDiffuser-style scheduler facilitates switching between modeling paradigms. Its key innovation is using optical flow to derive latent actions, enabling pixel-level "delta action" extraction for efficient, large-scale pretraining on diverse datasets, yielding robust representations for robotic control and video synthesis.

Quick Start & Requirements

Installation requires cloning the repo, setting up a Python 3.10 conda environment, and installing PyTorch (CUDA 12.8), FlashAttention, and project dependencies. High VRAM is essential: >24 GB for basic inference, ~41 GB for full inference, and >80 GB for training (RTX 5090 to A100/H100/B200). Pretrained checkpoints are available via Hugging Face CLI. Detailed guides for data, inference, and training are linked.

Highlighted Details

Achieved 87.02% average success rate on RoboTwin 2.0 multi-task training, outperforming X-VLA (+15%) and π₀.₅ (+45%).
Total model size is ~8 billion parameters (VGM ~5B, VLM ~2.13B).
Utilizes a three-stage training pipeline and a six-layer data pyramid.
Supports RoboTwin 2.0, LeRobotDataset, AC-One, and Aloha-Agilex-2 data formats.

Maintenance & Community

Presented as an initial release (December 2025), the project welcomes community contributions for maintenance and extensions. Specific community channels, roadmaps, or maintainer details are not provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This omission hinders assessment for commercial use or integration into closed-source projects.

Limitations & Caveats

Substantial VRAM requirements limit accessibility. As an initial release, users should expect potential early-stage bugs or API changes. The undisclosed license is the most critical adoption blocker.

Motus by thu-ml

Explore Similar Projects

Awesome-VLA-Papers by Psi-Robot

vla0 by NVlabs

unified_video_action by ShuangLI59

LAPA by LatentActionPretraining

molmoact by allenai

lingbot-va by Robbyant

VideoWorld by ByteDance-Seed

CogACT by microsoft

RynnBrain by alibaba-damo-academy

UniVLA by OpenDriveLab

Magma by microsoft

Isaac-GR00T by NVIDIA