Wan2.1  by Wan-Video

Video foundation model for text-to-video, image-to-video, and video editing

Created 6 months ago
14,083 stars

Top 3.5% on SourcePulse

GitHubView on GitHub
Project Summary

Wan2.1 is an open-source suite of advanced large-scale video generative models designed for researchers and power users. It offers state-of-the-art performance in text-to-video, image-to-video, and video editing, with a notable focus on accessibility for consumer-grade GPUs.

How It Works

Wan2.1 is built upon the diffusion transformer paradigm, incorporating a novel 3D causal VAE (Wan-VAE) for efficient spatio-temporal compression and temporal causality. It utilizes a T5 encoder for multilingual text input and employs a shared MLP across transformer blocks for processing time embeddings, leading to significant performance improvements. The models are trained on a large, curated dataset with a four-step cleaning process to ensure high quality and diversity.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt. Ensure torch >= 2.4.0.
  • Models: Download checkpoints from Hugging Face or ModelScope.
  • Inference: Run generation scripts (e.g., python generate.py --task t2v-14B ...).
  • Dependencies: Python, PyTorch, Hugging Face libraries. Consumer GPUs with at least 8.19 GB VRAM are supported for the 1.3B model.
  • Resources: Multi-GPU inference is supported via FSDP and xDiT USP.
  • Demos: Gradio demos are available. See Diffusers integration and ComfyUI integration.

Highlighted Details

  • Supports Text-to-Video (T2V), Image-to-Video (I2V), First-Last-Frame-to-Video (FLF2V), and Text-to-Image (T2I).
  • T2V-1.3B model requires only 8.19 GB VRAM, generating 480P video in ~4 minutes on an RTX 4090.
  • Capable of generating both Chinese and English text within videos.
  • Wan-VAE can encode/decode unlimited-length 1080P videos while preserving temporal information.
  • Offers prompt extension capabilities using Dashscope API or local Qwen models.

Maintenance & Community

  • Active development with recent updates in April 2025.
  • Integrations with Diffusers and ComfyUI.
  • Community works and extensions are highlighted.
  • Community channels: Discord, WeChat groups.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Generated content is free to use, with restrictions against illegal, harmful, or misleading content.

Limitations & Caveats

  • The 1.3B T2V model's 720P generation is less stable than 480P.
  • FLF2V and I2V models are primarily trained on Chinese text-video pairs, recommending Chinese prompts for optimal results.
  • Diffusers integration for prompt extension and multi-GPU inference is noted as upcoming.
Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
13
Star History
411 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.