Text-to-image generation research paper using grounded prompts
Top 21.5% on sourcepulse
GLIGEN enables grounded text-to-image generation by extending frozen diffusion models to incorporate various spatial prompts like bounding boxes, keypoints, and edge maps. This allows for precise control over image content and composition, outperforming existing supervised methods on benchmarks like COCO and LVIS.
How It Works
GLIGEN integrates a novel "grounding tokenizer" module into standard Stable Diffusion architectures. This module learns to condition the diffusion process on spatial information, effectively translating bounding boxes, keypoints, or other conditioning maps into a format understandable by the diffusion model. This approach leverages the power of pre-trained text-to-image models while adding fine-grained spatial control without requiring extensive retraining of the base model.
Quick Start & Requirements
python gligen_inference.py
after placing models in gligen_checkpoints
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided semantic map checkpoint is trained only on ADE20K, and the normal map checkpoint only on DIODE. The project aims to reproduce paper results, and minor implementation differences may exist. Responsible AI use is emphasized, with discouragement of misuse for misleading or malicious image generation.
1 year ago
1+ week