Motus  by thu-ml

Unified latent action world model for robotics

Created 1 month ago
547 stars

Top 58.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Motus is a unified latent action world model integrating video generation, understanding, and action capabilities via a Mixture-of-Transformers (MoT) architecture and a flexible UniDiffuser-style scheduler. It leverages optical flow for pixel-level "delta action" extraction, enabling large-scale pretraining and diverse modeling modes (World Models, VLA, Video Gen). This provides a powerful, unified framework for robotics and generative AI research, simplifying complex sequential decision-making tasks.

How It Works

Motus employs a Mixture-of-Transformers (MoT) with three experts: video generation (VGM), vision-language (VLM), and action. A UniDiffuser-style scheduler facilitates switching between modeling paradigms. Its key innovation is using optical flow to derive latent actions, enabling pixel-level "delta action" extraction for efficient, large-scale pretraining on diverse datasets, yielding robust representations for robotic control and video synthesis.

Quick Start & Requirements

Installation requires cloning the repo, setting up a Python 3.10 conda environment, and installing PyTorch (CUDA 12.8), FlashAttention, and project dependencies. High VRAM is essential: >24 GB for basic inference, ~41 GB for full inference, and >80 GB for training (RTX 5090 to A100/H100/B200). Pretrained checkpoints are available via Hugging Face CLI. Detailed guides for data, inference, and training are linked.

Highlighted Details

  • Achieved 87.02% average success rate on RoboTwin 2.0 multi-task training, outperforming X-VLA (+15%) and π₀.₅ (+45%).
  • Total model size is ~8 billion parameters (VGM ~5B, VLM ~2.13B).
  • Utilizes a three-stage training pipeline and a six-layer data pyramid.
  • Supports RoboTwin 2.0, LeRobotDataset, AC-One, and Aloha-Agilex-2 data formats.

Maintenance & Community

Presented as an initial release (December 2025), the project welcomes community contributions for maintenance and extensions. Specific community channels, roadmaps, or maintainer details are not provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This omission hinders assessment for commercial use or integration into closed-source projects.

Limitations & Caveats

Substantial VRAM requirements limit accessibility. As an initial release, users should expect potential early-stage bugs or API changes. The undisclosed license is the most critical adoption blocker.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
13
Star History
550 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.