Wan2.1 is an open-source suite of advanced large-scale video generative models designed for researchers and power users. It offers state-of-the-art performance in text-to-video, image-to-video, and video editing, with a notable focus on accessibility for consumer-grade GPUs.
How It Works
Wan2.1 is built upon the diffusion transformer paradigm, incorporating a novel 3D causal VAE (Wan-VAE) for efficient spatio-temporal compression and temporal causality. It utilizes a T5 encoder for multilingual text input and employs a shared MLP across transformer blocks for processing time embeddings, leading to significant performance improvements. The models are trained on a large, curated dataset with a four-step cleaning process to ensure high quality and diversity.
Quick Start & Requirements
- Installation: Clone the repository and install dependencies via
pip install -r requirements.txt
. Ensure torch >= 2.4.0
.
- Models: Download checkpoints from Hugging Face or ModelScope.
- Inference: Run generation scripts (e.g.,
python generate.py --task t2v-14B ...
).
- Dependencies: Python, PyTorch, Hugging Face libraries. Consumer GPUs with at least 8.19 GB VRAM are supported for the 1.3B model.
- Resources: Multi-GPU inference is supported via FSDP and xDiT USP.
- Demos: Gradio demos are available. See Diffusers integration and ComfyUI integration.
Highlighted Details
- Supports Text-to-Video (T2V), Image-to-Video (I2V), First-Last-Frame-to-Video (FLF2V), and Text-to-Image (T2I).
- T2V-1.3B model requires only 8.19 GB VRAM, generating 480P video in ~4 minutes on an RTX 4090.
- Capable of generating both Chinese and English text within videos.
- Wan-VAE can encode/decode unlimited-length 1080P videos while preserving temporal information.
- Offers prompt extension capabilities using Dashscope API or local Qwen models.
Maintenance & Community
- Active development with recent updates in April 2025.
- Integrations with Diffusers and ComfyUI.
- Community works and extensions are highlighted.
- Community channels: Discord, WeChat groups.
Licensing & Compatibility
- Licensed under the Apache 2.0 License.
- Generated content is free to use, with restrictions against illegal, harmful, or misleading content.
Limitations & Caveats
- The 1.3B T2V model's 720P generation is less stable than 480P.
- FLF2V and I2V models are primarily trained on Chinese text-video pairs, recommending Chinese prompts for optimal results.
- Diffusers integration for prompt extension and multi-GPU inference is noted as upcoming.