GLM-Image  by zai-org

High-fidelity image generation with superior text rendering

Created 2 weeks ago

New!

686 stars

Top 49.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

GLM-Image is an auto-regressive image generation model focused on high-fidelity and knowledge-intensive synthesis. It addresses challenges in precise semantic understanding and complex information expression, offering advantages over mainstream latent diffusion models, particularly in text-rendering scenarios. It is suited for users needing advanced text-to-image and image-to-image capabilities.

How It Works

It employs a hybrid autoregressive + diffusion decoder architecture. An autoregressive generator (based on GLM-4-9B) produces initial visual tokens, followed by a 7B-parameter diffusion decoder (DiT-based) for latent-space decoding. A Glyph Encoder enhances text rendering accuracy. Refinement uses decoupled reinforcement learning (GRPO), with the autoregressive module handling aesthetics/semantics and the decoder focusing on detail/text accuracy.

Quick Start & Requirements

Install via pip:

pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git

Requires CUDA, torch_dtype=torch.bfloat16, and device_map="cuda". Significant VRAM is needed: >80GB on a single GPU or multi-GPU setup. Model download links are on 🤗 Hugging Face and 🤖 ModelScope.

Highlighted Details

  • Excels in text-rendering and knowledge-intensive generation, outperforming models on benchmarks like LongText-Bench.
  • Supports diverse image-to-image tasks: editing, style transfer, identity preservation, and multi-subject consistency.
  • Features a novel hybrid autoregressive + diffusion decoder architecture.
  • Utilizes GRPO for reinforcement learning to improve semantic understanding and visual detail.

Maintenance & Community

Community channels include WeChat and Discord. Technical details are available on the GLM-Image Technical Blog and Model Card.

Licensing & Compatibility

The README does not specify the license type or compatibility for commercial use.

Limitations & Caveats

High runtime cost and substantial hardware demands (>80GB VRAM single GPU or multi-GPU). Target image resolutions must be divisible by 32. SGLang integration for AR speedup is in progress.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
17
Star History
688 stars in the last 18 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.