mochi  by genmoai

Video generation model

created 10 months ago
3,323 stars

Top 14.9% on sourcepulse

GitHubView on GitHub
Project Summary

Mochi 1 is an open-source, state-of-the-art video generation model designed to bridge the gap between closed and open-source solutions. It offers high-fidelity motion and strong prompt adherence, targeting researchers and developers looking to build advanced video generation applications.

How It Works

Mochi 1 utilizes a novel Asymmetric Diffusion Transformer (AsymmDiT) architecture, a 10 billion parameter diffusion model trained from scratch. It employs an AsymmVAE for efficient video compression, reducing videos to 128x smaller sizes with a 8x8 spatial and 6x temporal compression. The AsymmDiT architecture jointly attends to text and visual tokens using multi-modal self-attention, with separate MLPs for each modality and non-square QKV/output projection layers for reduced inference memory. Prompts are encoded using a single T5-XXL language model.

Quick Start & Requirements

  • Install: Clone the repository and install using uv:
    git clone https://github.com/genmoai/models
    cd models
    pip install uv
    uv venv .venv
    source .venv/bin/activate
    uv pip install -e . --no-build-isolation
    
  • Dependencies: FFMPEG is required for video output. Flash attention can be installed with uv pip install -e .[flash] --no-build-isolation.
  • Weights: Download weights using scripts/download_weights.py or from Hugging Face.
  • Running: Start the Gradio UI with python3 ./demos/gradio_ui.py --model_dir weights/ --cpu_offload or use the CLI demo.
  • Resources: Requires approximately 60GB VRAM for single-GPU operation. Recommended: 1 H100 GPU. ComfyUI integration can optimize for <20GB VRAM.
  • Docs: Mochi 1 Blog, Playground

Highlighted Details

  • Features a 10 billion parameter Asymmetric Diffusion Transformer (AsymmDiT) architecture.
  • Open-sourced AsymmVAE for efficient video compression.
  • Supports LoRA fine-tuning for custom video generation.
  • Offers a composable API for programmatic use.

Maintenance & Community

  • Recent updates include LoRA fine-tuning support and consumer-GPU support in ComfyUI.
  • Related projects include ComfyUI-MochiWrapper and ComfyUI-MochiEdit.
  • Fine-tuning scripts are available for Modal GPUs.

Licensing & Compatibility

  • Released under the permissive Apache 2.0 license.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The research preview currently generates videos at 480p. Minor warping and distortions may occur with extreme motion. The model is optimized for photorealistic styles and does not perform well with animated content. Organizations should implement additional safety protocols before commercial deployment.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
208 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.