HunyuanVideo-1.5  by Tencent-Hunyuan

Lightweight, high-quality video generation model

Created 1 month ago
3,037 stars

Top 15.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

HunyuanVideo-1.5 is a lightweight, high-performance video generation model offering state-of-the-art quality with an accessible 8.3B parameters, designed for consumer GPUs. It supports both text-to-video (T2V) and image-to-video (I2V) generation, lowering barriers for developers and creators.

How It Works

The model features an 8.3B-parameter Diffusion Transformer (DiT) with a 3D causal VAE. Its core innovation, Selective and Sliding Tile Attention (SSTA), prunes computations to accelerate inference. It incorporates meticulous data curation, glyph-aware text encoding, and a multi-stage progressive training strategy for enhanced motion coherence and visual quality.

Quick Start & Requirements

  • Installation: Clone repo, pip install -r requirements.txt, pip install tencentcloud-sdk-python. Flash Attention, Flex-Block-Attention, SageAttention recommended for performance.
  • Prerequisites: Linux, Python 3.10+, NVIDIA GPU (14GB+ VRAM recommended with offloading).
  • Running: Use torchrun --nproc_per_node=<N> generate.py .... Pretrained models require separate download.
  • Links: GitHub: https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

Highlighted Details

  • Architecture: 8.3B DiT with 3D VAE, achieving significant compression.
  • SSTA: Selective and Sliding Tile Attention provides $1.87 \times$ speedup for 10s 720p synthesis vs. FlashAttention-3.
  • Enhancements: Includes an efficient few-step super-resolution network upscaling to 1080p.
  • Capabilities: Demonstrates strong instruction following, cinematic aesthetics, text rendering, and physics compliance via advanced prompt rewriting.
  • Performance: Optimized inference with CFG distillation and sparse attention, tested on 8 H800 GPUs.

Maintenance & Community

Community contributions are encouraged (e.g., ComfyUI plugins). WeChat and Discord channels are available. Acknowledges open-source contributions from Transformers, Diffusers, HuggingFace, and Qwen-VL.

Licensing & Compatibility

The license type is not specified in the provided README text.

Limitations & Caveats

Distillation and sparse attention models are noted as "coming soon." Diffusers support is not yet implemented. Primary environment appears to be Linux.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
1,265 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.