latent-diffusion  by CompVis

Image synthesis research paper using latent diffusion models

Created 3 years ago
13,316 stars

Top 3.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for Latent Diffusion Models (LDMs), a class of generative models capable of high-resolution image synthesis. It targets researchers and practitioners in computer vision and deep learning interested in state-of-the-art image generation, offering pre-trained models and code for various tasks including text-to-image, inpainting, and retrieval-augmented generation.

How It Works

LDMs operate by performing diffusion in a lower-dimensional latent space learned by an autoencoder. This approach significantly reduces computational cost compared to diffusion in pixel space, enabling high-resolution synthesis with greater efficiency. The models leverage a U-Net architecture for the diffusion process and can be conditioned on various inputs like text embeddings or retrieved image features, allowing for controllable and context-aware generation.

Quick Start & Requirements

  • Install: Create and activate a conda environment using conda env create -f environment.yaml and conda activate ldm.
  • Prerequisites: PyTorch, transformers, scann, kornia, torchmetrics, einops. Specific versions are noted for retrieval-augmented models.
  • Models: Pre-trained models for various tasks (text-to-image, inpainting, etc.) and datasets (ImageNet, LSUN, CelebA-HQ) are available for download via provided links and scripts (scripts/download_models.sh, scripts/download_first_stages.sh).
  • Demo: A web demo using Huggingface Spaces is available.

Highlighted Details

  • Supports text-conditional image synthesis with a 1.45B parameter model trained on LAION-400M.
  • Achieves a FID of 3.6 on ImageNet with classifier-free guidance.
  • Includes code for Retrieval-Augmented Diffusion Models (RDMs) for enhanced control and retrieval-based sampling.
  • Offers pre-trained autoencoders with varying latent space dimensions (f=4, 8, 16, 32) and regularization (VQ, KL), with reported rFID scores.

Maintenance & Community

The project is associated with the Ommer Lab at Heidelberg University. Key contributors include Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. The codebase builds upon OpenAI's ADM and lucidrains' denoising-diffusion-pytorch and x-transformers.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the underlying components and the nature of the research suggest a focus on academic and research use. Commercial use would require careful review of any associated licenses for dependencies and pre-trained models.

Limitations & Caveats

The README mentions that for resolutions beyond 256x256, controllability is reduced. Some retrieval databases (e.g., OpenImages) are large (11GB+) and may require significant disk space and processing time for index creation. The ArtBench databases are noted as less effective for detailed text control.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
126 stars in the last 30 days

Explore Similar Projects

Starred by Robin Rombach Robin Rombach(Cofounder of Black Forest Labs), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

Kandinsky-2 by ai-forever

0.0%
3k
Multilingual text-to-image latent diffusion model
Created 2 years ago
Updated 1 year ago
Starred by Robin Huang Robin Huang(Cofounder of Comfy Org), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
17 more.

stablediffusion by Stability-AI

0.1%
42k
Latent diffusion model for high-resolution image synthesis
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.