latent-diffusion by CompVis

Image synthesis research paper using latent diffusion models

Created 4 years ago

13,761 stars

Top 3.6% on SourcePulse

View on GitHub

15 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Benjamin Bolte

Cofounder of K-Scale Labs

Jaret Burkett

Founder of Ostris

Jinze Bai

Research Scientist at Alibaba Qwen

and 11 more!

Project Summary

This repository provides the official implementation for Latent Diffusion Models (LDMs), a class of generative models capable of high-resolution image synthesis. It targets researchers and practitioners in computer vision and deep learning interested in state-of-the-art image generation, offering pre-trained models and code for various tasks including text-to-image, inpainting, and retrieval-augmented generation.

How It Works

LDMs operate by performing diffusion in a lower-dimensional latent space learned by an autoencoder. This approach significantly reduces computational cost compared to diffusion in pixel space, enabling high-resolution synthesis with greater efficiency. The models leverage a U-Net architecture for the diffusion process and can be conditioned on various inputs like text embeddings or retrieved image features, allowing for controllable and context-aware generation.

Quick Start & Requirements

Install: Create and activate a conda environment using conda env create -f environment.yaml and conda activate ldm.
Prerequisites: PyTorch, transformers, scann, kornia, torchmetrics, einops. Specific versions are noted for retrieval-augmented models.
Models: Pre-trained models for various tasks (text-to-image, inpainting, etc.) and datasets (ImageNet, LSUN, CelebA-HQ) are available for download via provided links and scripts (scripts/download_models.sh, scripts/download_first_stages.sh).
Demo: A web demo using Huggingface Spaces is available.

Highlighted Details

Supports text-conditional image synthesis with a 1.45B parameter model trained on LAION-400M.
Achieves a FID of 3.6 on ImageNet with classifier-free guidance.
Includes code for Retrieval-Augmented Diffusion Models (RDMs) for enhanced control and retrieval-based sampling.
Offers pre-trained autoencoders with varying latent space dimensions (f=4, 8, 16, 32) and regularization (VQ, KL), with reported rFID scores.

Maintenance & Community

The project is associated with the Ommer Lab at Heidelberg University. Key contributors include Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. The codebase builds upon OpenAI's ADM and lucidrains' denoising-diffusion-pytorch and x-transformers.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the underlying components and the nature of the research suggest a focus on academic and research use. Commercial use would require careful review of any associated licenses for dependencies and pre-trained models.

Limitations & Caveats

The README mentions that for resolutions beyond 256x256, controllability is reduced. Some retrieval databases (e.g., OpenImages) are large (11GB+) and may require significant disk space and processing time for index creation. The ArtBench databases are noted as less effective for detailed text control.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

125 stars in the last 30 days