Text-to-image diffusion transformer with Chinese understanding
Top 11.8% on sourcepulse
Hunyuan-DiT is a powerful, open-source diffusion transformer model for text-to-image generation, excelling in both English and Chinese prompts. It offers advanced features like multi-turn dialogue for iterative image refinement and supports ControlNet and IP-Adapter for enhanced control.
How It Works
Hunyuan-DiT operates in the latent space, leveraging a pre-trained VAE to compress images. The core diffusion model is a transformer architecture, enhanced by a bilingual CLIP and multilingual T5 text encoder for robust language understanding. A key innovation is the integration of a Multimodal Large Language Model (MLLM) for processing multi-turn dialogues, enabling users to iteratively refine generated images through natural language conversations.
Quick Start & Requirements
environment.yml
, then install requirements. Docker images for CUDA 11/12 are also provided.huggingface-cli
. Inference requires ~11GB VRAM for the base model, with options for 4-bit quantization reducing DialogGen memory to ~22GB.Highlighted Details
Maintenance & Community
The project is actively maintained by Tencent and has seen contributions from various community members, including integrations with ComfyUI and Kohya_ss.
Licensing & Compatibility
The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.
Limitations & Caveats
While 6GB VRAM inference is supported via diffusers
and bitsandbytes
, the full feature set and optimal performance, especially for training and multi-turn dialogue, benefit significantly from higher VRAM (32GB+ recommended).
6 months ago
1 week