HunyuanDiT  by Tencent-Hunyuan

Text-to-image diffusion transformer with Chinese understanding

Created 1 year ago
4,294 stars

Top 11.3% on SourcePulse

GitHubView on GitHub
Project Summary

Hunyuan-DiT is a powerful, open-source diffusion transformer model for text-to-image generation, excelling in both English and Chinese prompts. It offers advanced features like multi-turn dialogue for iterative image refinement and supports ControlNet and IP-Adapter for enhanced control.

How It Works

Hunyuan-DiT operates in the latent space, leveraging a pre-trained VAE to compress images. The core diffusion model is a transformer architecture, enhanced by a bilingual CLIP and multilingual T5 text encoder for robust language understanding. A key innovation is the integration of a Multimodal Large Language Model (MLLM) for processing multi-turn dialogues, enabling users to iteratively refine generated images through natural language conversations.

Quick Start & Requirements

  • Installation: Clone the repository and set up a Conda environment using environment.yml, then install requirements. Docker images for CUDA 11/12 are also provided.
  • Prerequisites: NVIDIA GPU with CUDA support (minimum 11GB VRAM, 14GB recommended for RTX 3090/4090, 32GB for A100). CUDA 11.7 or 12.0+ recommended. Flash Attention v2 installation is optional but recommended for acceleration.
  • Resources: Download pre-trained models using huggingface-cli. Inference requires ~11GB VRAM for the base model, with options for 4-bit quantization reducing DialogGen memory to ~22GB.
  • Links: Project Page, Diffusers Integration, ComfyUI Integration

Highlighted Details

  • State-of-the-art performance in Chinese text-to-image generation based on human evaluations.
  • Supports multi-turn text-to-image generation via a DialogGen model.
  • Offers acceleration via Distillation and TensorRT versions.
  • Integrates ControlNet (canny, depth, pose) and IP-Adapter.
  • Provides LoRA training and inference capabilities.
  • Includes Hunyuan-Captioner for fine-grained image captioning.

Maintenance & Community

The project is actively maintained by Tencent and has seen contributions from various community members, including integrations with ComfyUI and Kohya_ss.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

While 6GB VRAM inference is supported via diffusers and bitsandbytes, the full feature set and optimal performance, especially for training and multi-turn dialogue, benefit significantly from higher VRAM (32GB+ recommended).

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and Alexander Borzunov Alexander Borzunov(Research Scientist at OpenAI).

ru-dalle by ai-forever

0%
2k
Text-to-image generation in Russian
Created 4 years ago
Updated 3 years ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

IP-Adapter by tencent-ailab

0.1%
7k
Adapter for image prompt in text-to-image diffusion models
Created 2 years ago
Updated 1 year ago
Starred by Deepak Pathak Deepak Pathak(Cofounder of Skild AI; Professor at CMU), Travis Fischer Travis Fischer(Founder of Agentic), and
8 more.

sygil-webui by Sygil-Dev

0.0%
8k
Web UI for Stable Diffusion
Created 3 years ago
Updated 4 months ago
Feedback? Help us improve.