GLIGEN  by gligen

Text-to-image generation research paper using grounded prompts

Created 2 years ago
2,157 stars

Top 20.9% on SourcePulse

GitHubView on GitHub
Project Summary

GLIGEN enables grounded text-to-image generation by extending frozen diffusion models to incorporate various spatial prompts like bounding boxes, keypoints, and edge maps. This allows for precise control over image content and composition, outperforming existing supervised methods on benchmarks like COCO and LVIS.

How It Works

GLIGEN integrates a novel "grounding tokenizer" module into standard Stable Diffusion architectures. This module learns to condition the diffusion process on spatial information, effectively translating bounding boxes, keypoints, or other conditioning maps into a format understandable by the diffusion model. This approach leverages the power of pre-trained text-to-image models while adding fine-grained spatial control without requiring extensive retraining of the base model.

Quick Start & Requirements

  • Install: Dockerfile provided for environment setup.
  • Models: Download checkpoints from Hugging Face Hub for various modalities (Box+Text, Keypoint, HED map, etc.).
  • Inference: Run python gligen_inference.py after placing models in gligen_checkpoints.
  • Requirements: PyTorch, Stable Diffusion base models. GPU recommended.

Highlighted Details

  • Supports open-set grounded generation and inpainting.
  • Integrates with Grounding DINO for automatic bounding box localization.
  • Achieves zero-shot performance exceeding supervised layout-to-image baselines.
  • Offers checkpoints for various conditioning modalities including HED, Canny, Depth, and Semantic maps.

Maintenance & Community

  • Code and checkpoints released March 2023.
  • Paper accepted to CVPR 2023.
  • Integrated into LLaVA-Interactive demo.
  • Project page, paper, and demo links available.

Licensing & Compatibility

  • Terms and conditions restrict use according to Latent Diffusion Model and Stable Diffusion licenses.
  • Primarily for research purposes.

Limitations & Caveats

The provided semantic map checkpoint is trained only on ADE20K, and the normal map checkpoint only on DIODE. The project aims to reproduce paper results, and minor implementation differences may exist. Responsible AI use is emphasized, with discouragement of misuse for misleading or malicious image generation.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Max Howell Max Howell(Author of Homebrew), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

big-sleep by lucidrains

0%
3k
CLI tool for text-to-image generation
Created 4 years ago
Updated 3 years ago
Feedback? Help us improve.