LLM-groundedDiffusion  by TonyLianLong

Research paper enhancing text-to-image diffusion models using LLMs

created 2 years ago
476 stars

Top 65.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides LLM-grounded Diffusion (LMD), a framework that enhances text-to-image diffusion models by leveraging Large Language Models (LLMs) for improved prompt understanding and control. It targets researchers and developers working with generative AI, offering a method to generate images that more accurately reflect complex textual descriptions, including spatial relationships and object attributes.

How It Works

LMD operates in two stages. First, an LLM parses the input text prompt to generate an intermediate representation, typically including captioned bounding boxes, a background prompt, and a negative prompt. This LLM-generated layout guides the diffusion process. Second, this layout is used with various layout-to-image diffusion techniques (like GLIGEN adapters, attention control, or region control) to generate the final image. This approach allows for fine-grained control over image composition without requiring extensive model retraining.

Quick Start & Requirements

  • Install via pip: pip install -r requirements.txt
  • Requires Python 3.x.
  • Supports SD v1, SD v2, and SDXL with refiner.
  • Can utilize OpenAI API (GPT-3.5/4) or self-hosted open-source LLMs (e.g., Mixtral, Llama 2) via FastChat.
  • Official HuggingFace demo available.
  • Project page and detailed documentation are linked in the README.

Highlighted Details

  • Integrates with upstream diffusers library (v0.24.0+).
  • Supports training-free LMD and LMD+ (with GLIGEN adapters).
  • Implements multiple stage 2 layout-to-image methods: GLIGEN, BoxDiff, MultiDiffusion, Backward Guidance.
  • Offers unified benchmarking for evaluating both text-to-layout and layout-to-image stages.
  • Supports FlashAttention and PyTorch v2.

Maintenance & Community

  • Developed by researchers from UC Berkeley/UCSF.
  • Recent updates include SDXL support and integration with diffusers.
  • Open-source LLM integration (Mixtral, StableBeluga2) highlighted for self-hosting.
  • Contact information for the primary author is provided.

Licensing & Compatibility

  • Code not from other repositories is MIT licensed with an additional note regarding BAIR Open Research Commons deposit.
  • Uses code from diffusers, GLIGEN, and layout-guidance, requiring adherence to their respective licenses.
  • Compatible with commercial use under MIT license, but dependent libraries may have different terms.

Limitations & Caveats

The performance of self-hosted open-source LLMs, while comparable to GPT-3.5, is noted to be lower than GPT-4 for layout generation, with LLM fine-tuning suggested for future improvement. The diffusers integration is a simplified version of LMD+.

Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.