Discover and explore top open-source AI tools and projects—updated daily.
zai-orgHigh-fidelity image generation with superior text rendering
New!
Top 49.7% on SourcePulse
Summary
GLM-Image is an auto-regressive image generation model focused on high-fidelity and knowledge-intensive synthesis. It addresses challenges in precise semantic understanding and complex information expression, offering advantages over mainstream latent diffusion models, particularly in text-rendering scenarios. It is suited for users needing advanced text-to-image and image-to-image capabilities.
How It Works
It employs a hybrid autoregressive + diffusion decoder architecture. An autoregressive generator (based on GLM-4-9B) produces initial visual tokens, followed by a 7B-parameter diffusion decoder (DiT-based) for latent-space decoding. A Glyph Encoder enhances text rendering accuracy. Refinement uses decoupled reinforcement learning (GRPO), with the autoregressive module handling aesthetics/semantics and the decoder focusing on detail/text accuracy.
Quick Start & Requirements
Install via pip:
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/huggingface/diffusers.git
Requires CUDA, torch_dtype=torch.bfloat16, and device_map="cuda". Significant VRAM is needed: >80GB on a single GPU or multi-GPU setup. Model download links are on 🤗 Hugging Face and 🤖 ModelScope.
Highlighted Details
Maintenance & Community
Community channels include WeChat and Discord. Technical details are available on the GLM-Image Technical Blog and Model Card.
Licensing & Compatibility
The README does not specify the license type or compatibility for commercial use.
Limitations & Caveats
High runtime cost and substantial hardware demands (>80GB VRAM single GPU or multi-GPU). Target image resolutions must be divisible by 32. SGLang integration for AR speedup is in progress.
1 week ago
Inactive
sharonzhou
afiaka87
YangLing0818
QwenLM
deep-floyd