HunyuanImage-3.0  by Tencent-Hunyuan

Native multimodal model for advanced image generation

Created 2 weeks ago

New!

2,202 stars

Top 20.5% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> HunyuanImage-3.0 is a powerful native multimodal model for image generation, addressing the need for high-fidelity, contextually rich visual outputs. It targets researchers and developers seeking state-of-the-art text-to-image capabilities, offering performance comparable to or exceeding leading closed-source models through an advanced autoregressive framework.

How It Works

This project employs a unified autoregressive framework, diverging from typical DiT architectures, to directly model text and image modalities. It features the largest open-source Mixture of Experts (MoE) model to date, comprising 64 experts and 80 billion total parameters (13 billion active per token). This design enables intelligent world-knowledge reasoning, allowing the model to automatically elaborate on sparse prompts with contextually relevant details for superior image generation.

Quick Start & Requirements

  • Requirements: Linux, NVIDIA GPU (CUDA 12.8), Python 3.11+, PyTorch 2.7.1. Requires 170GB disk space and ≥3x80GB VRAM (4x80GB recommended).
  • Installation: Install PyTorch (cu128), tencentcloud-sdk-python, and requirements.txt. Optional optimizations: FlashAttention, FlashInfer for up to 3x faster inference.
  • Usage: Download weights from HuggingFace (hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3). Run via Transformers library or run_image_gen.py. Interactive Gradio demo available.
  • Links: Official website (implied), HuggingFace, GitHub.

Highlighted Details

  • Features the largest open-source image generation MoE model (80B total parameters, 64 experts).
  • Unified autoregressive architecture for integrated multimodal understanding and generation.
  • Demonstrates superior image generation performance with exceptional prompt adherence and photorealism.
  • Incorporates intelligent world-knowledge reasoning for automatic prompt elaboration.

Maintenance & Community

The project welcomes community contributions and mentions WeChat and Discord channels, though direct links are not provided in the README. Key components like inference code and checkpoints are open-sourced, with plans for Instruct Checkpoints, VLLM support, and Image-to-Image generation.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The base pre-trained checkpoint requires external prompt enhancement (e.g., DeepSeek). The model name tencent/HunyuanImage-3.0 requires local download/rename for Transformers loading due to the dot. Instruct Checkpoints, VLLM support, and Image-to-Image generation are not yet open-sourced.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
12
Issues (30d)
36
Star History
2,215 stars in the last 19 days

Explore Similar Projects

Feedback? Help us improve.