HunyuanVideo is an open-source framework for large-scale video generation, aiming to match or exceed closed-source model performance. It targets researchers and developers in AI video generation, offering a robust foundation for creating high-quality, diverse, and text-aligned video content.
How It Works
HunyuanVideo employs a unified architecture for image and video generation using a Transformer with Full Attention. It utilizes a "Dual-stream to Single-stream" approach, processing modalities separately before fusing them. A key innovation is the use of a Decoder-Only MLLM as a text encoder, offering improved image-text alignment and detail description over traditional CLIP or T5 encoders. Video compression is handled by a 3D VAE with CausalConv3D, reducing token count for efficient diffusion transformer processing.
Quick Start & Requirements
- Install: Clone the repo, create a conda environment, install PyTorch (CUDA 11.8 or 12.4), flash-attention, and other dependencies via
requirements.txt
.
- Prerequisites: NVIDIA GPU with CUDA 11.8/12.4+, Python 3.10.9.
- Hardware: Minimum 45GB GPU memory for 544x960x129f, 60GB for 720x1280x129f. 80GB recommended. Linux OS.
- Links: Project Page, Paper, Diffusers Integration.
Highlighted Details
- Outperforms leading closed-source models in human evaluations, particularly in motion quality.
- Offers FP8 quantized weights for reduced GPU memory usage.
- Supports parallel inference via xDiT for multi-GPU acceleration.
- Includes a prompt rewrite module for enhanced text-to-video alignment.
- Released an Image-to-Video (I2V) model based on the same framework.
Maintenance & Community
- Active development with recent releases including FP8 weights and Diffusers integration.
- Community contributions are highlighted, including ComfyUI wrappers and optimization projects.
- Links to WeChat and Discord are available for community engagement.
Licensing & Compatibility
- The repository itself is not explicitly licensed in the README. Model weights are available on Hugging Face.
- Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
- The README mentions a "fast version" used for current releases, which differs from the "high-quality version" used in benchmark evaluations, implying potential quality trade-offs in the released model.
- Installation can be complex, with specific CUDA and PyTorch version requirements and potential float point exceptions requiring troubleshooting.