HunyuanDiT  by Tencent-Hunyuan

Text-to-image diffusion transformer with Chinese understanding

Created 1 year ago
4,240 stars

Top 11.6% on SourcePulse

GitHubView on GitHub
Project Summary

Hunyuan-DiT is a powerful, open-source diffusion transformer model for text-to-image generation, excelling in both English and Chinese prompts. It offers advanced features like multi-turn dialogue for iterative image refinement and supports ControlNet and IP-Adapter for enhanced control.

How It Works

Hunyuan-DiT operates in the latent space, leveraging a pre-trained VAE to compress images. The core diffusion model is a transformer architecture, enhanced by a bilingual CLIP and multilingual T5 text encoder for robust language understanding. A key innovation is the integration of a Multimodal Large Language Model (MLLM) for processing multi-turn dialogues, enabling users to iteratively refine generated images through natural language conversations.

Quick Start & Requirements

  • Installation: Clone the repository and set up a Conda environment using environment.yml, then install requirements. Docker images for CUDA 11/12 are also provided.
  • Prerequisites: NVIDIA GPU with CUDA support (minimum 11GB VRAM, 14GB recommended for RTX 3090/4090, 32GB for A100). CUDA 11.7 or 12.0+ recommended. Flash Attention v2 installation is optional but recommended for acceleration.
  • Resources: Download pre-trained models using huggingface-cli. Inference requires ~11GB VRAM for the base model, with options for 4-bit quantization reducing DialogGen memory to ~22GB.
  • Links: Project Page, Diffusers Integration, ComfyUI Integration

Highlighted Details

  • State-of-the-art performance in Chinese text-to-image generation based on human evaluations.
  • Supports multi-turn text-to-image generation via a DialogGen model.
  • Offers acceleration via Distillation and TensorRT versions.
  • Integrates ControlNet (canny, depth, pose) and IP-Adapter.
  • Provides LoRA training and inference capabilities.
  • Includes Hunyuan-Captioner for fine-grained image captioning.

Maintenance & Community

The project is actively maintained by Tencent and has seen contributions from various community members, including integrations with ComfyUI and Kohya_ss.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

While 6GB VRAM inference is supported via diffusers and bitsandbytes, the full feature set and optimal performance, especially for training and multi-turn dialogue, benefit significantly from higher VRAM (32GB+ recommended).

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and Alexander Borzunov Alexander Borzunov(Research Scientist at OpenAI).

ru-dalle by ai-forever

0%
2k
Text-to-image generation in Russian
Created 3 years ago
Updated 2 years ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

IP-Adapter by tencent-ailab

0.3%
6k
Adapter for image prompt in text-to-image diffusion models
Created 2 years ago
Updated 1 year ago
Starred by Deepak Pathak Deepak Pathak(Cofounder of Skild AI; Professor at CMU), Travis Fischer Travis Fischer(Founder of Agentic), and
8 more.

sygil-webui by Sygil-Dev

0.0%
8k
Web UI for Stable Diffusion
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.