HunyuanImage-2.1 by Tencent-Hunyuan

High-resolution 2K text-to-image generation

Created 5 months ago

671 stars

Top 50.2% on SourcePulse

Project Summary

HunyuanImage-2.1 addresses high-resolution (2K) text-to-image generation, offering enhanced text-image alignment and efficiency. Targeted at researchers and power users, it provides a robust solution for generating detailed, semantically accurate images with multilingual prompt support and advanced features like prompt enhancement.

How It Works

This model employs a two-stage diffusion transformer architecture, featuring a 17 billion parameter base model and a refiner. Key innovations include a high-compression VAE (32x) aligned with DINOv2 for efficient 2K image generation, dual text encoders (MLLM and multilingual ByT5) for improved semantic understanding and text rendering, and Reinforcement Learning from Human Feedback (RLHF) for aesthetic refinement. Meanflow distillation is utilized for faster, high-quality sampling.

Quick Start & Requirements

Installation involves cloning the repository (git clone https://github.com/Tencent-Hunyuan/HunyuanImage-2.1.git), navigating into the directory (cd HunyuanImage-2.1), installing dependencies via pip install -r requirements.txt, and pip install flash-attn==2.7.3 --no-build-isolation. Requires Linux, an NVIDIA GPU with CUDA support, and a minimum of 36GB GPU memory (with CPU offloading). An FP8-quantized model for lower memory usage is anticipated. Official repository: https://github.com/Tencent-Hunyuan/HunyuanImage-2.1

Highlighted Details

Generates ultra-high-definition (2K) images with cinematic composition.
Supports both Chinese and English prompts natively, with glyph-aware text rendering via ByT5.
Offers flexible aspect ratio support (1:1, 16:9, 9:16, etc.).
Features an automatic prompt rewriting module (PromptEnhancer) for improved descriptive accuracy.
Achieves state-of-the-art semantic alignment among open-source models per SSAE evaluation, comparable to closed-source alternatives in GSB benchmarks.

Maintenance & Community

The project acknowledges contributions from Qwen, FLUX, diffusers, and HuggingFace. No specific community channels (e.g., Discord, Slack) or detailed roadmap information were present in the provided text. The release date for inference code and weights is noted as September 8, 2025.

Licensing & Compatibility

No specific license information was provided in the README excerpt.

Limitations & Caveats

The model is restricted to Linux environments and exclusively supports 2K resolution generation; lower resolutions produce artifacts. It demands substantial GPU memory (36GB minimum), although an FP8 version is planned to mitigate this.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days