Wan2.2  by Wan-Video

Advanced video generation models with MoE architecture

Created 1 month ago
4,639 stars

Top 10.6% on SourcePulse

GitHubView on GitHub
Project Summary

Wan2.2 offers advanced large-scale video generation models, targeting researchers and developers seeking high-quality, controllable video synthesis. It introduces a Mixture-of-Experts (MoE) architecture for increased capacity, cinematic aesthetic control through detailed labeling, and enhanced complex motion generation via expanded training data.

How It Works

Wan2.2 employs a Mixture-of-Experts (MoE) architecture within its diffusion models, splitting the denoising process across timesteps with specialized expert models. This design boosts overall model capacity while maintaining computational costs. Additionally, it incorporates meticulously curated aesthetic data for precise control over lighting, composition, and color, enabling cinematic-style generation. The models are trained on significantly larger datasets, improving generalization across motion, semantics, and aesthetics.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Ensure torch >= 2.4.0.
  • Models: Download checkpoints for T2V-A14B, I2V-A14B, or TI2V-5B from Hugging Face or ModelScope.
  • Hardware: Single-GPU inference for T2V/I2V-A14B requires ~80GB VRAM. TI2V-5B can run on a 24GB VRAM GPU (e.g., RTX 4090) with offloading. Multi-GPU inference is supported via FSDP + DeepSpeed Ulysses.
  • Resources: Inference times vary; TI2V-5B generates a 5-second 720P video in under 9 minutes on a single consumer GPU.
  • Demos & Docs: Available on Hugging Face Spaces and integrated into ComfyUI and Diffusers.

Highlighted Details

  • Features a 5B dense model (TI2V-5B) with a high-compression VAE (16x16x4 compression ratio) supporting 720P@24fps on consumer GPUs.
  • MoE architecture splits denoising across timesteps with specialized experts, increasing total parameters while keeping inference costs similar.
  • Trained on significantly more data (+65.6% images, +83.2% videos) than previous versions for improved generalization.
  • Supports prompt extension using Dashscope API or local models for richer video details.

Maintenance & Community

The project is actively maintained with integrations into ComfyUI and Diffusers. Community support is available via Discord and WeChat groups.

Licensing & Compatibility

Licensed under the Apache 2.0 License. Generated content usage is permitted, provided it complies with the license terms and applicable laws, prohibiting harmful, misleading, or privacy-violating content.

Limitations & Caveats

High-end GPUs (80GB VRAM) are recommended for optimal performance on larger models (A14B series). While TI2V-5B is more accessible, it still requires significant resources for high-resolution generation. Prompt extension requires API keys or local model setup.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
12
Issues (30d)
54
Star History
2,599 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.