HunyuanDiT  by Tencent-Hunyuan

Text-to-image diffusion transformer with Chinese understanding

created 1 year ago
4,214 stars

Top 11.8% on sourcepulse

GitHubView on GitHub
Project Summary

Hunyuan-DiT is a powerful, open-source diffusion transformer model for text-to-image generation, excelling in both English and Chinese prompts. It offers advanced features like multi-turn dialogue for iterative image refinement and supports ControlNet and IP-Adapter for enhanced control.

How It Works

Hunyuan-DiT operates in the latent space, leveraging a pre-trained VAE to compress images. The core diffusion model is a transformer architecture, enhanced by a bilingual CLIP and multilingual T5 text encoder for robust language understanding. A key innovation is the integration of a Multimodal Large Language Model (MLLM) for processing multi-turn dialogues, enabling users to iteratively refine generated images through natural language conversations.

Quick Start & Requirements

  • Installation: Clone the repository and set up a Conda environment using environment.yml, then install requirements. Docker images for CUDA 11/12 are also provided.
  • Prerequisites: NVIDIA GPU with CUDA support (minimum 11GB VRAM, 14GB recommended for RTX 3090/4090, 32GB for A100). CUDA 11.7 or 12.0+ recommended. Flash Attention v2 installation is optional but recommended for acceleration.
  • Resources: Download pre-trained models using huggingface-cli. Inference requires ~11GB VRAM for the base model, with options for 4-bit quantization reducing DialogGen memory to ~22GB.
  • Links: Project Page, Diffusers Integration, ComfyUI Integration

Highlighted Details

  • State-of-the-art performance in Chinese text-to-image generation based on human evaluations.
  • Supports multi-turn text-to-image generation via a DialogGen model.
  • Offers acceleration via Distillation and TensorRT versions.
  • Integrates ControlNet (canny, depth, pose) and IP-Adapter.
  • Provides LoRA training and inference capabilities.
  • Includes Hunyuan-Captioner for fine-grained image captioning.

Maintenance & Community

The project is actively maintained by Tencent and has seen contributions from various community members, including integrations with ComfyUI and Kohya_ss.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

While 6GB VRAM inference is supported via diffusers and bitsandbytes, the full feature set and optimal performance, especially for training and multi-turn dialogue, benefit significantly from higher VRAM (32GB+ recommended).

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
158 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.