LLM-groundedDiffusion by TonyLianLong

Research paper enhancing text-to-image diffusion models using LLMs

Created 2 years ago

480 stars

Top 63.8% on SourcePulse

Project Summary

This repository provides LLM-grounded Diffusion (LMD), a framework that enhances text-to-image diffusion models by leveraging Large Language Models (LLMs) for improved prompt understanding and control. It targets researchers and developers working with generative AI, offering a method to generate images that more accurately reflect complex textual descriptions, including spatial relationships and object attributes.

How It Works

LMD operates in two stages. First, an LLM parses the input text prompt to generate an intermediate representation, typically including captioned bounding boxes, a background prompt, and a negative prompt. This LLM-generated layout guides the diffusion process. Second, this layout is used with various layout-to-image diffusion techniques (like GLIGEN adapters, attention control, or region control) to generate the final image. This approach allows for fine-grained control over image composition without requiring extensive model retraining.

Quick Start & Requirements

Install via pip: pip install -r requirements.txt
Requires Python 3.x.
Supports SD v1, SD v2, and SDXL with refiner.
Can utilize OpenAI API (GPT-3.5/4) or self-hosted open-source LLMs (e.g., Mixtral, Llama 2) via FastChat.
Official HuggingFace demo available.
Project page and detailed documentation are linked in the README.

Highlighted Details

Integrates with upstream diffusers library (v0.24.0+).
Supports training-free LMD and LMD+ (with GLIGEN adapters).
Implements multiple stage 2 layout-to-image methods: GLIGEN, BoxDiff, MultiDiffusion, Backward Guidance.
Offers unified benchmarking for evaluating both text-to-layout and layout-to-image stages.
Supports FlashAttention and PyTorch v2.

Maintenance & Community

Developed by researchers from UC Berkeley/UCSF.
Recent updates include SDXL support and integration with diffusers.
Open-source LLM integration (Mixtral, StableBeluga2) highlighted for self-hosting.
Contact information for the primary author is provided.

Licensing & Compatibility

Code not from other repositories is MIT licensed with an additional note regarding BAIR Open Research Commons deposit.
Uses code from diffusers, GLIGEN, and layout-guidance, requiring adherence to their respective licenses.
Compatible with commercial use under MIT license, but dependent libraries may have different terms.

Limitations & Caveats

The performance of self-hosted open-source LLMs, while comparable to GPT-3.5, is noted to be lower than GPT-4 for layout generation, with LLM fine-tuning suggested for future improvement. The diffusers integration is a simplified version of LMD+.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days