HunyuanVideo-1.5 by Tencent-Hunyuan

Lightweight, high-quality video generation model

Created 3 months ago

4,481 stars

Top 10.8% on SourcePulse

Project Summary

Summary

HunyuanVideo-1.5 is a lightweight, high-performance video generation model offering state-of-the-art quality with an accessible 8.3B parameters, designed for consumer GPUs. It supports both text-to-video (T2V) and image-to-video (I2V) generation, lowering barriers for developers and creators.

How It Works

The model features an 8.3B-parameter Diffusion Transformer (DiT) with a 3D causal VAE. Its core innovation, Selective and Sliding Tile Attention (SSTA), prunes computations to accelerate inference. It incorporates meticulous data curation, glyph-aware text encoding, and a multi-stage progressive training strategy for enhanced motion coherence and visual quality.

Quick Start & Requirements

Installation: Clone repo, pip install -r requirements.txt, pip install tencentcloud-sdk-python. Flash Attention, Flex-Block-Attention, SageAttention recommended for performance.
Prerequisites: Linux, Python 3.10+, NVIDIA GPU (14GB+ VRAM recommended with offloading).
Running: Use torchrun --nproc_per_node=<N> generate.py .... Pretrained models require separate download.
Links: GitHub: https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

Highlighted Details

Architecture: 8.3B DiT with 3D VAE, achieving significant compression.
SSTA: Selective and Sliding Tile Attention provides $1.87 \times$ speedup for 10s 720p synthesis vs. FlashAttention-3.
Enhancements: Includes an efficient few-step super-resolution network upscaling to 1080p.
Capabilities: Demonstrates strong instruction following, cinematic aesthetics, text rendering, and physics compliance via advanced prompt rewriting.
Performance: Optimized inference with CFG distillation and sparse attention, tested on 8 H800 GPUs.

Maintenance & Community

Community contributions are encouraged (e.g., ComfyUI plugins). WeChat and Discord channels are available. Acknowledges open-source contributions from Transformers, Diffusers, HuggingFace, and Qwen-VL.

Licensing & Compatibility

The license type is not specified in the provided README text.

Limitations & Caveats

Distillation and sparse attention models are noted as "coming soon." Diffusers support is not yet implemented. Primary environment appears to be Linux.

HunyuanVideo-1.5 by Tencent-Hunyuan

Explore Similar Projects

t2v-turbo by Ji4chenLi

FreeNoise by AILab-CVC

kandinsky-5 by kandinskylab

Allegro by rhymes-ai

Pyramid-Flow by jy0205

Step-Video-T2V by stepfun-ai

Text2Video-Zero by Picsart-AI-Research

mochi by genmoai

LTX-Video by Lightricks

Wan2.2 by Wan-Video

Wan2.1 by Wan-Video

Open-Sora by hpcaitech