HunyuanDiT by Tencent-Hunyuan

Text-to-image diffusion transformer with Chinese understanding

Created 1 year ago

4,283 stars

Top 11.4% on SourcePulse

Project Summary

Hunyuan-DiT is a powerful, open-source diffusion transformer model for text-to-image generation, excelling in both English and Chinese prompts. It offers advanced features like multi-turn dialogue for iterative image refinement and supports ControlNet and IP-Adapter for enhanced control.

How It Works

Hunyuan-DiT operates in the latent space, leveraging a pre-trained VAE to compress images. The core diffusion model is a transformer architecture, enhanced by a bilingual CLIP and multilingual T5 text encoder for robust language understanding. A key innovation is the integration of a Multimodal Large Language Model (MLLM) for processing multi-turn dialogues, enabling users to iteratively refine generated images through natural language conversations.

Quick Start & Requirements

Installation: Clone the repository and set up a Conda environment using environment.yml, then install requirements. Docker images for CUDA 11/12 are also provided.
Prerequisites: NVIDIA GPU with CUDA support (minimum 11GB VRAM, 14GB recommended for RTX 3090/4090, 32GB for A100). CUDA 11.7 or 12.0+ recommended. Flash Attention v2 installation is optional but recommended for acceleration.
Resources: Download pre-trained models using huggingface-cli. Inference requires ~11GB VRAM for the base model, with options for 4-bit quantization reducing DialogGen memory to ~22GB.
Links: Project Page, Diffusers Integration, ComfyUI Integration

Highlighted Details

State-of-the-art performance in Chinese text-to-image generation based on human evaluations.
Supports multi-turn text-to-image generation via a DialogGen model.
Offers acceleration via Distillation and TensorRT versions.
Integrates ControlNet (canny, depth, pose) and IP-Adapter.
Provides LoRA training and inference capabilities.
Includes Hunyuan-Captioner for fine-grained image captioning.

Maintenance & Community

The project is actively maintained by Tencent and has seen contributions from various community members, including integrations with ComfyUI and Kohya_ss.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

While 6GB VRAM inference is supported via diffusers and bitsandbytes, the full feature set and optimal performance, especially for training and multi-turn dialogue, benefit significantly from higher VRAM (32GB+ recommended).

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days