Wan2.1 by Wan-Video

Video foundation model for text-to-video, image-to-video, and video editing

Created 10 months ago

15,084 stars

Top 3.3% on SourcePulse

View on GitHub

5 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Jiaming Song

Chief Scientist at Luma AI

Jesse Clark

Cofounder of Marqo

and 1 more!

Project Summary

Wan2.1 is an open-source suite of advanced large-scale video generative models designed for researchers and power users. It offers state-of-the-art performance in text-to-video, image-to-video, and video editing, with a notable focus on accessibility for consumer-grade GPUs.

How It Works

Wan2.1 is built upon the diffusion transformer paradigm, incorporating a novel 3D causal VAE (Wan-VAE) for efficient spatio-temporal compression and temporal causality. It utilizes a T5 encoder for multilingual text input and employs a shared MLP across transformer blocks for processing time embeddings, leading to significant performance improvements. The models are trained on a large, curated dataset with a four-step cleaning process to ensure high quality and diversity.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Ensure torch >= 2.4.0.
Models: Download checkpoints from Hugging Face or ModelScope.
Inference: Run generation scripts (e.g., python generate.py --task t2v-14B ...).
Dependencies: Python, PyTorch, Hugging Face libraries. Consumer GPUs with at least 8.19 GB VRAM are supported for the 1.3B model.
Resources: Multi-GPU inference is supported via FSDP and xDiT USP.
Demos: Gradio demos are available. See Diffusers integration and ComfyUI integration.

Highlighted Details

Supports Text-to-Video (T2V), Image-to-Video (I2V), First-Last-Frame-to-Video (FLF2V), and Text-to-Image (T2I).
T2V-1.3B model requires only 8.19 GB VRAM, generating 480P video in ~4 minutes on an RTX 4090.
Capable of generating both Chinese and English text within videos.
Wan-VAE can encode/decode unlimited-length 1080P videos while preserving temporal information.
Offers prompt extension capabilities using Dashscope API or local Qwen models.

Maintenance & Community

Active development with recent updates in April 2025.
Integrations with Diffusers and ComfyUI.
Community works and extensions are highlighted.
Community channels: Discord, WeChat groups.

Licensing & Compatibility

Licensed under the Apache 2.0 License.
Generated content is free to use, with restrictions against illegal, harmful, or misleading content.

Limitations & Caveats

The 1.3B T2V model's 720P generation is less stable than 480P.
FLF2V and I2V models are primarily trained on Chinese text-video pairs, recommending Chinese prompts for optimal results.
Diffusers integration for prompt extension and multi-GPU inference is noted as upcoming.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

240 stars in the last 30 days