Wan2.2 by Wan-Video

Advanced video generation models with MoE architecture

Created 5 months ago

13,489 stars

Top 3.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Project Summary

Wan2.2 offers advanced large-scale video generation models, targeting researchers and developers seeking high-quality, controllable video synthesis. It introduces a Mixture-of-Experts (MoE) architecture for increased capacity, cinematic aesthetic control through detailed labeling, and enhanced complex motion generation via expanded training data.

How It Works

Wan2.2 employs a Mixture-of-Experts (MoE) architecture within its diffusion models, splitting the denoising process across timesteps with specialized expert models. This design boosts overall model capacity while maintaining computational costs. Additionally, it incorporates meticulously curated aesthetic data for precise control over lighting, composition, and color, enabling cinematic-style generation. The models are trained on significantly larger datasets, improving generalization across motion, semantics, and aesthetics.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Ensure torch >= 2.4.0.
Models: Download checkpoints for T2V-A14B, I2V-A14B, or TI2V-5B from Hugging Face or ModelScope.
Hardware: Single-GPU inference for T2V/I2V-A14B requires ~80GB VRAM. TI2V-5B can run on a 24GB VRAM GPU (e.g., RTX 4090) with offloading. Multi-GPU inference is supported via FSDP + DeepSpeed Ulysses.
Resources: Inference times vary; TI2V-5B generates a 5-second 720P video in under 9 minutes on a single consumer GPU.
Demos & Docs: Available on Hugging Face Spaces and integrated into ComfyUI and Diffusers.

Highlighted Details

Features a 5B dense model (TI2V-5B) with a high-compression VAE (16x16x4 compression ratio) supporting 720P@24fps on consumer GPUs.
MoE architecture splits denoising across timesteps with specialized experts, increasing total parameters while keeping inference costs similar.
Trained on significantly more data (+65.6% images, +83.2% videos) than previous versions for improved generalization.
Supports prompt extension using Dashscope API or local models for richer video details.

Maintenance & Community

The project is actively maintained with integrations into ComfyUI and Diffusers. Community support is available via Discord and WeChat groups.

Licensing & Compatibility

Licensed under the Apache 2.0 License. Generated content usage is permitted, provided it complies with the license terms and applicable laws, prohibiting harmful, misleading, or privacy-violating content.

Limitations & Caveats

High-end GPUs (80GB VRAM) are recommended for optimal performance on larger models (A14B series). While TI2V-5B is more accessible, it still requires significant resources for high-resolution generation. Prompt extension requires API keys or local model setup.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

880 stars in the last 30 days