GLIGEN by gligen

Text-to-image generation research paper using grounded prompts

Created 3 years ago

2,188 stars

Top 20.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Omar Sanseviero

DevRel at Google DeepMind

Haotian Liu

Author of LLaVA; Research Scientist at xAI

Project Summary

GLIGEN enables grounded text-to-image generation by extending frozen diffusion models to incorporate various spatial prompts like bounding boxes, keypoints, and edge maps. This allows for precise control over image content and composition, outperforming existing supervised methods on benchmarks like COCO and LVIS.

How It Works

GLIGEN integrates a novel "grounding tokenizer" module into standard Stable Diffusion architectures. This module learns to condition the diffusion process on spatial information, effectively translating bounding boxes, keypoints, or other conditioning maps into a format understandable by the diffusion model. This approach leverages the power of pre-trained text-to-image models while adding fine-grained spatial control without requiring extensive retraining of the base model.

Quick Start & Requirements

Install: Dockerfile provided for environment setup.
Models: Download checkpoints from Hugging Face Hub for various modalities (Box+Text, Keypoint, HED map, etc.).
Inference: Run python gligen_inference.py after placing models in gligen_checkpoints.
Requirements: PyTorch, Stable Diffusion base models. GPU recommended.

Highlighted Details

Supports open-set grounded generation and inpainting.
Integrates with Grounding DINO for automatic bounding box localization.
Achieves zero-shot performance exceeding supervised layout-to-image baselines.
Offers checkpoints for various conditioning modalities including HED, Canny, Depth, and Semantic maps.

Maintenance & Community

Code and checkpoints released March 2023.
Paper accepted to CVPR 2023.
Integrated into LLaVA-Interactive demo.
Project page, paper, and demo links available.

Licensing & Compatibility

Terms and conditions restrict use according to Latent Diffusion Model and Stable Diffusion licenses.
Primarily for research purposes.

Limitations & Caveats

The provided semantic map checkpoint is trained only on ADE20K, and the normal map checkpoint only on DIODE. The project aims to reproduce paper results, and minor implementation differences may exist. Responsible AI use is emphasized, with discouragement of misuse for misleading or malicious image generation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days