GLIGEN  by gligen

Text-to-image generation research paper using grounded prompts

created 2 years ago
2,139 stars

Top 21.5% on sourcepulse

GitHubView on GitHub
Project Summary

GLIGEN enables grounded text-to-image generation by extending frozen diffusion models to incorporate various spatial prompts like bounding boxes, keypoints, and edge maps. This allows for precise control over image content and composition, outperforming existing supervised methods on benchmarks like COCO and LVIS.

How It Works

GLIGEN integrates a novel "grounding tokenizer" module into standard Stable Diffusion architectures. This module learns to condition the diffusion process on spatial information, effectively translating bounding boxes, keypoints, or other conditioning maps into a format understandable by the diffusion model. This approach leverages the power of pre-trained text-to-image models while adding fine-grained spatial control without requiring extensive retraining of the base model.

Quick Start & Requirements

  • Install: Dockerfile provided for environment setup.
  • Models: Download checkpoints from Hugging Face Hub for various modalities (Box+Text, Keypoint, HED map, etc.).
  • Inference: Run python gligen_inference.py after placing models in gligen_checkpoints.
  • Requirements: PyTorch, Stable Diffusion base models. GPU recommended.

Highlighted Details

  • Supports open-set grounded generation and inpainting.
  • Integrates with Grounding DINO for automatic bounding box localization.
  • Achieves zero-shot performance exceeding supervised layout-to-image baselines.
  • Offers checkpoints for various conditioning modalities including HED, Canny, Depth, and Semantic maps.

Maintenance & Community

  • Code and checkpoints released March 2023.
  • Paper accepted to CVPR 2023.
  • Integrated into LLaVA-Interactive demo.
  • Project page, paper, and demo links available.

Licensing & Compatibility

  • Terms and conditions restrict use according to Latent Diffusion Model and Stable Diffusion licenses.
  • Primarily for research purposes.

Limitations & Caveats

The provided semantic map checkpoint is trained only on ADE20K, and the normal map checkpoint only on DIODE. The project aims to reproduce paper results, and minor implementation differences may exist. Responsible AI use is emphasized, with discouragement of misuse for misleading or malicious image generation.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Travis Fischer Travis Fischer(Founder of Agentic), and
3 more.

consistency_models by openai

0.0%
6k
PyTorch code for consistency models research paper
created 2 years ago
updated 1 year ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.