Discover and explore top open-source AI tools and projects—updated daily.
thu-mlUnified latent action world model for robotics
Top 58.3% on SourcePulse
Summary
Motus is a unified latent action world model integrating video generation, understanding, and action capabilities via a Mixture-of-Transformers (MoT) architecture and a flexible UniDiffuser-style scheduler. It leverages optical flow for pixel-level "delta action" extraction, enabling large-scale pretraining and diverse modeling modes (World Models, VLA, Video Gen). This provides a powerful, unified framework for robotics and generative AI research, simplifying complex sequential decision-making tasks.
How It Works
Motus employs a Mixture-of-Transformers (MoT) with three experts: video generation (VGM), vision-language (VLM), and action. A UniDiffuser-style scheduler facilitates switching between modeling paradigms. Its key innovation is using optical flow to derive latent actions, enabling pixel-level "delta action" extraction for efficient, large-scale pretraining on diverse datasets, yielding robust representations for robotic control and video synthesis.
Quick Start & Requirements
Installation requires cloning the repo, setting up a Python 3.10 conda environment, and installing PyTorch (CUDA 12.8), FlashAttention, and project dependencies. High VRAM is essential: >24 GB for basic inference, ~41 GB for full inference, and >80 GB for training (RTX 5090 to A100/H100/B200). Pretrained checkpoints are available via Hugging Face CLI. Detailed guides for data, inference, and training are linked.
Highlighted Details
Maintenance & Community
Presented as an initial release (December 2025), the project welcomes community contributions for maintenance and extensions. Specific community channels, roadmaps, or maintainer details are not provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. This omission hinders assessment for commercial use or integration into closed-source projects.
Limitations & Caveats
Substantial VRAM requirements limit accessibility. As an initial release, users should expect potential early-stage bugs or API changes. The undisclosed license is the most critical adoption blocker.
6 days ago
Inactive
microsoft
NVIDIA