GLM-Image by zai-org

High-fidelity image generation with superior text rendering

Created 2 months ago

816 stars

Top 43.5% on SourcePulse

Project Summary

Summary

GLM-Image is an auto-regressive image generation model focused on high-fidelity and knowledge-intensive synthesis. It addresses challenges in precise semantic understanding and complex information expression, offering advantages over mainstream latent diffusion models, particularly in text-rendering scenarios. It is suited for users needing advanced text-to-image and image-to-image capabilities.

How It Works

It employs a hybrid autoregressive + diffusion decoder architecture. An autoregressive generator (based on GLM-4-9B) produces initial visual tokens, followed by a 7B-parameter diffusion decoder (DiT-based) for latent-space decoding. A Glyph Encoder enhances text rendering accuracy. Refinement uses decoupled reinforcement learning (GRPO), with the autoregressive module handling aesthetics/semantics and the decoder focusing on detail/text accuracy.

Quick Start & Requirements

Install via pip:

pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git

Requires CUDA, torch_dtype=torch.bfloat16, and device_map="cuda". Significant VRAM is needed: >80GB on a single GPU or multi-GPU setup. Model download links are on 🤗 Hugging Face and 🤖 ModelScope.

Highlighted Details

Excels in text-rendering and knowledge-intensive generation, outperforming models on benchmarks like LongText-Bench.
Supports diverse image-to-image tasks: editing, style transfer, identity preservation, and multi-subject consistency.
Features a novel hybrid autoregressive + diffusion decoder architecture.
Utilizes GRPO for reinforcement learning to improve semantic understanding and visual detail.

Maintenance & Community

Community channels include WeChat and Discord. Technical details are available on the GLM-Image Technical Blog and Model Card.

Licensing & Compatibility

The README does not specify the license type or compatibility for commercial use.

Limitations & Caveats

High runtime cost and substantial hardware demands (>80GB VRAM single GPU or multi-GPU). Target image resolutions must be divisible by 32. SGLang integration for AR speedup is in progress.

GLM-Image by zai-org

Explore Similar Projects

X-Omni by X-Omni-Team

NextFlow by ByteVisionLab

NextStep-1 by stepfun-ai

UltraPixel by catcathh

diffusion-self-distillation by primecai

Lumina-mGPT-2.0 by Alpha-VLLM

long_stable_diffusion by sharonzhou

clip-guided-diffusion by afiaka87

RPG-DiffusionMaster by YangLing0818

HunyuanImage-3.0 by Tencent-Hunyuan

Qwen-Image by QwenLM

IF by deep-floyd