HunyuanVideo  by Tencent-Hunyuan

PyTorch code for video generation research

created 8 months ago
10,781 stars

Top 4.8% on sourcepulse

GitHubView on GitHub
Project Summary

HunyuanVideo is an open-source framework for large-scale video generation, aiming to match or exceed closed-source model performance. It targets researchers and developers in AI video generation, offering a robust foundation for creating high-quality, diverse, and text-aligned video content.

How It Works

HunyuanVideo employs a unified architecture for image and video generation using a Transformer with Full Attention. It utilizes a "Dual-stream to Single-stream" approach, processing modalities separately before fusing them. A key innovation is the use of a Decoder-Only MLLM as a text encoder, offering improved image-text alignment and detail description over traditional CLIP or T5 encoders. Video compression is handled by a 3D VAE with CausalConv3D, reducing token count for efficient diffusion transformer processing.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment, install PyTorch (CUDA 11.8 or 12.4), flash-attention, and other dependencies via requirements.txt.
  • Prerequisites: NVIDIA GPU with CUDA 11.8/12.4+, Python 3.10.9.
  • Hardware: Minimum 45GB GPU memory for 544x960x129f, 60GB for 720x1280x129f. 80GB recommended. Linux OS.
  • Links: Project Page, Paper, Diffusers Integration.

Highlighted Details

  • Outperforms leading closed-source models in human evaluations, particularly in motion quality.
  • Offers FP8 quantized weights for reduced GPU memory usage.
  • Supports parallel inference via xDiT for multi-GPU acceleration.
  • Includes a prompt rewrite module for enhanced text-to-video alignment.
  • Released an Image-to-Video (I2V) model based on the same framework.

Maintenance & Community

  • Active development with recent releases including FP8 weights and Diffusers integration.
  • Community contributions are highlighted, including ComfyUI wrappers and optimization projects.
  • Links to WeChat and Discord are available for community engagement.

Licensing & Compatibility

  • The repository itself is not explicitly licensed in the README. Model weights are available on Hugging Face.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • The README mentions a "fast version" used for current releases, which differs from the "high-quality version" used in benchmark evaluations, implying potential quality trade-offs in the released model.
  • Installation can be complex, with specific CUDA and PyTorch version requirements and potential float point exceptions requiring troubleshooting.
Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
7
Star History
990 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

CogVideo by zai-org

0.4%
12k
Text-to-video generation models (CogVideoX, CogVideo)
created 3 years ago
updated 1 month ago
Feedback? Help us improve.