Mochi 1 is an open-source, state-of-the-art video generation model designed to bridge the gap between closed and open-source solutions. It offers high-fidelity motion and strong prompt adherence, targeting researchers and developers looking to build advanced video generation applications.
How It Works
Mochi 1 utilizes a novel Asymmetric Diffusion Transformer (AsymmDiT) architecture, a 10 billion parameter diffusion model trained from scratch. It employs an AsymmVAE for efficient video compression, reducing videos to 128x smaller sizes with a 8x8 spatial and 6x temporal compression. The AsymmDiT architecture jointly attends to text and visual tokens using multi-modal self-attention, with separate MLPs for each modality and non-square QKV/output projection layers for reduced inference memory. Prompts are encoded using a single T5-XXL language model.
Quick Start & Requirements
uv
:
git clone https://github.com/genmoai/models
cd models
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install -e . --no-build-isolation
uv pip install -e .[flash] --no-build-isolation
.scripts/download_weights.py
or from Hugging Face.python3 ./demos/gradio_ui.py --model_dir weights/ --cpu_offload
or use the CLI demo.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The research preview currently generates videos at 480p. Minor warping and distortions may occur with extreme motion. The model is optimized for photorealistic styles and does not perform well with animated content. Organizations should implement additional safety protocols before commercial deployment.
6 months ago
1 week