HunyuanImage-3.0 by Tencent-Hunyuan

Native multimodal model for advanced image generation

Created 2 months ago

2,524 stars

Top 18.4% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> HunyuanImage-3.0 is a powerful native multimodal model for image generation, addressing the need for high-fidelity, contextually rich visual outputs. It targets researchers and developers seeking state-of-the-art text-to-image capabilities, offering performance comparable to or exceeding leading closed-source models through an advanced autoregressive framework.

How It Works

This project employs a unified autoregressive framework, diverging from typical DiT architectures, to directly model text and image modalities. It features the largest open-source Mixture of Experts (MoE) model to date, comprising 64 experts and 80 billion total parameters (13 billion active per token). This design enables intelligent world-knowledge reasoning, allowing the model to automatically elaborate on sparse prompts with contextually relevant details for superior image generation.

Quick Start & Requirements

Requirements: Linux, NVIDIA GPU (CUDA 12.8), Python 3.11+, PyTorch 2.7.1. Requires 170GB disk space and ≥3x80GB VRAM (4x80GB recommended).
Installation: Install PyTorch (cu128), tencentcloud-sdk-python, and requirements.txt. Optional optimizations: FlashAttention, FlashInfer for up to 3x faster inference.
Usage: Download weights from HuggingFace (hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3). Run via Transformers library or run_image_gen.py. Interactive Gradio demo available.
Links: Official website (implied), HuggingFace, GitHub.

Highlighted Details

Features the largest open-source image generation MoE model (80B total parameters, 64 experts).
Unified autoregressive architecture for integrated multimodal understanding and generation.
Demonstrates superior image generation performance with exceptional prompt adherence and photorealism.
Incorporates intelligent world-knowledge reasoning for automatic prompt elaboration.

Maintenance & Community

The project welcomes community contributions and mentions WeChat and Discord channels, though direct links are not provided in the README. Key components like inference code and checkpoints are open-sourced, with plans for Instruct Checkpoints, VLLM support, and Image-to-Image generation.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The base pre-trained checkpoint requires external prompt enhancement (e.g., DeepSeek). The model name tencent/HunyuanImage-3.0 requires local download/rename for Transformers loading due to the dot. Instruct Checkpoints, VLLM support, and Image-to-Image generation are not yet open-sourced.

HunyuanImage-3.0 by Tencent-Hunyuan

Explore Similar Projects

X-Omni by X-Omni-Team

UltraPixel by catcathh

Lumina-mGPT by Alpha-VLLM

Liquid by FoundationVision

Lumina-mGPT-2.0 by Alpha-VLLM

stable-diffusion-pytorch by kjsman

BLIP3o by JiuhaiChen

RPG-DiffusionMaster by YangLing0818

InternLM-XComposer by InternLM

Qwen-Image by QwenLM

Bagel by ByteDance-Seed

Janus by deepseek-ai