Wan2.1  by Wan-Video

Video foundation model for text-to-video, image-to-video, and video editing

created 5 months ago
13,403 stars

Top 3.8% on sourcepulse

GitHubView on GitHub
Project Summary

Wan2.1 is an open-source suite of advanced large-scale video generative models designed for researchers and power users. It offers state-of-the-art performance in text-to-video, image-to-video, and video editing, with a notable focus on accessibility for consumer-grade GPUs.

How It Works

Wan2.1 is built upon the diffusion transformer paradigm, incorporating a novel 3D causal VAE (Wan-VAE) for efficient spatio-temporal compression and temporal causality. It utilizes a T5 encoder for multilingual text input and employs a shared MLP across transformer blocks for processing time embeddings, leading to significant performance improvements. The models are trained on a large, curated dataset with a four-step cleaning process to ensure high quality and diversity.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Ensure torch >= 2.4.0.
  • Models: Download checkpoints from Hugging Face or ModelScope.
  • Inference: Run generation scripts (e.g., python generate.py --task t2v-14B ...).
  • Dependencies: Python, PyTorch, Hugging Face libraries. Consumer GPUs with at least 8.19 GB VRAM are supported for the 1.3B model.
  • Resources: Multi-GPU inference is supported via FSDP and xDiT USP.
  • Demos: Gradio demos are available. See Diffusers integration and ComfyUI integration.

Highlighted Details

  • Supports Text-to-Video (T2V), Image-to-Video (I2V), First-Last-Frame-to-Video (FLF2V), and Text-to-Image (T2I).
  • T2V-1.3B model requires only 8.19 GB VRAM, generating 480P video in ~4 minutes on an RTX 4090.
  • Capable of generating both Chinese and English text within videos.
  • Wan-VAE can encode/decode unlimited-length 1080P videos while preserving temporal information.
  • Offers prompt extension capabilities using Dashscope API or local Qwen models.

Maintenance & Community

  • Active development with recent updates in April 2025.
  • Integrations with Diffusers and ComfyUI.
  • Community works and extensions are highlighted.
  • Community channels: Discord, WeChat groups.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Generated content is free to use, with restrictions against illegal, harmful, or misleading content.

Limitations & Caveats

  • The 1.3B T2V model's 720P generation is less stable than 480P.
  • FLF2V and I2V models are primarily trained on Chinese text-video pairs, recommending Chinese prompts for optimal results.
  • Diffusers integration for prompt extension and multi-GPU inference is noted as upcoming.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
19
Star History
2,701 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

Open-Sora-Plan by PKU-YuanGroup

0.1%
12k
Open-source project aiming to reproduce Sora-like T2V model
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.