Attend-and-Excite  by yuval-alaluf

Research paper implementation for text-to-image diffusion models

created 2 years ago
743 stars

Top 47.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "Attend-and-Excite," a method to improve the semantic faithfulness of text-to-image diffusion models. It addresses issues like "catastrophic neglect" where models fail to generate all specified subjects or correctly bind attributes. The target audience includes researchers and practitioners working with diffusion models who need to enhance the accuracy of generated images based on complex text prompts.

How It Works

Attend-and-Excite employs a Generative Semantic Nursing (GSN) technique, specifically an attention-based formulation. During the inference process, it intervenes by modifying the cross-attention values within the diffusion model. This refinement encourages the model to attend to all subject tokens in the text prompt and strengthens their activations, thereby ensuring all described elements are generated with greater fidelity.

Quick Start & Requirements

  • Install: Follow the setup instructions in the environment/environment.yaml file for the base Stable Diffusion environment, then install additional requirements from environment/requirements.txt.
  • Prerequisites: Requires the official Stable Diffusion repository setup and Hugging Face's Diffusers library.
  • Usage: Run python run.py --prompt "a cat and a dog" --seeds [0] --token_indices [2,5].
  • Resources: Supports torch.float16 for reduced memory usage and faster inference.
  • Docs: Notebooks for generation and explainability are provided.

Highlighted Details

  • Enhances semantic faithfulness in text-to-image generation by addressing catastrophic neglect and attribute binding.
  • Utilizes an attention-based mechanism to guide cross-attention units during synthesis.
  • Includes code for reproducing quantitative experiments, such as CLIP similarity metrics.
  • Offers notebooks for generating images and visualizing cross-attention maps for explainability.

Maintenance & Community

The project is associated with Tel Aviv University. Citation details are provided for academic use.

Licensing & Compatibility

The repository builds upon the diffusers library and the Prompt-to-Prompt codebase. Specific licensing details beyond this are not explicitly stated in the README.

Limitations & Caveats

Using torch.float16 may lead to a slight degradation in results for some cases. The method is presented as an enhancement to existing pre-trained diffusion models like Stable Diffusion.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
28 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
created 3 years ago
updated 1 year ago
Feedback? Help us improve.