mochi by genmoai

Video generation model

Created 1 year ago

3,564 stars

Top 13.5% on SourcePulse

View on GitHub

11 Experts Love This Project

Travis Fischer

Founder of Agentic

Evan Conrad

Cofounder of SF Compute

and 7 more!

Project Summary

Mochi 1 is an open-source, state-of-the-art video generation model designed to bridge the gap between closed and open-source solutions. It offers high-fidelity motion and strong prompt adherence, targeting researchers and developers looking to build advanced video generation applications.

How It Works

Mochi 1 utilizes a novel Asymmetric Diffusion Transformer (AsymmDiT) architecture, a 10 billion parameter diffusion model trained from scratch. It employs an AsymmVAE for efficient video compression, reducing videos to 128x smaller sizes with a 8x8 spatial and 6x temporal compression. The AsymmDiT architecture jointly attends to text and visual tokens using multi-modal self-attention, with separate MLPs for each modality and non-square QKV/output projection layers for reduced inference memory. Prompts are encoded using a single T5-XXL language model.

Quick Start & Requirements

Install: Clone the repository and install using uv:

git clone https://github.com/genmoai/models
cd models
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install -e . --no-build-isolation

Dependencies: FFMPEG is required for video output. Flash attention can be installed with uv pip install -e .[flash] --no-build-isolation.
Weights: Download weights using scripts/download_weights.py or from Hugging Face.
Running: Start the Gradio UI with python3 ./demos/gradio_ui.py --model_dir weights/ --cpu_offload or use the CLI demo.
Resources: Requires approximately 60GB VRAM for single-GPU operation. Recommended: 1 H100 GPU. ComfyUI integration can optimize for <20GB VRAM.
Docs: Mochi 1 Blog, Playground

Highlighted Details

Features a 10 billion parameter Asymmetric Diffusion Transformer (AsymmDiT) architecture.
Open-sourced AsymmVAE for efficient video compression.
Supports LoRA fine-tuning for custom video generation.
Offers a composable API for programmatic use.

Maintenance & Community

Recent updates include LoRA fine-tuning support and consumer-GPU support in ComfyUI.
Related projects include ComfyUI-MochiWrapper and ComfyUI-MochiEdit.
Fine-tuning scripts are available for Modal GPUs.

Licensing & Compatibility

Released under the permissive Apache 2.0 license.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The research preview currently generates videos at 480p. Minor warping and distortions may occur with extreme motion. The model is optimized for photorealistic styles and does not perform well with animated content. Organizations should implement additional safety protocols before commercial deployment.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

39 stars in the last 30 days