rich-text-to-image  by songweige

Text-to-image research paper for enhanced generation control

Created 2 years ago
802 stars

Top 44.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project enables fine-grained control over text-to-image generation by leveraging rich text formatting (font size, color, style, footnotes) to guide diffusion models. It targets researchers and power users seeking to precisely dictate specific attributes of generated images, offering enhanced control beyond standard text prompts.

How It Works

The method first extracts spatial-text associations from a base diffusion model's cross-attention maps. Rich text prompts, encoded into JSON, provide formatting attributes for specific text spans. A novel region-based diffusion process then uses these attributes to render distinct regions with precise control over color, style, and token importance (via font size), resulting in globally coherent images.

Quick Start & Requirements

  • Install via git clone and conda env create -f environment.yaml, followed by pip install git+https://github.com/openai/CLIP.git.
  • Requires Python 3.8, PyTorch 1.11, and supports Stable Diffusion v1-5, SDXL, or ANIMAGINE-XL.
  • Official demo available on HuggingFace Space. An A1111 WebUI extension is also available.

Highlighted Details

  • Supports LoRA checkpoints and SD-XL models.
  • Enables precise color rendering using hex codes.
  • Allows local style control via font attributes (e.g., "style of Claude Monet").
  • Font size mapping to token reweighting for emphasis.
  • Footnotes can provide supplementary descriptions for specific regions.

Maintenance & Community

  • Implemented by Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin Huang.
  • Built upon HuggingFace diffusers and Quill rich-text editor.
  • Paper accepted by ICCV 2023.

Licensing & Compatibility

  • The repository does not explicitly state a license. The underlying diffusers library is typically Apache 2.0, but this specific project's license is unstated.

Limitations & Caveats

  • The project does not specify a license, which may impact commercial use or integration into closed-source projects.
  • Setup requires specific older versions of PyTorch (1.11), which might conflict with newer environments.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
11 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.