taming-transformers by CompVis

Image synthesis research paper using transformers

Created 5 years ago

6,404 stars

Top 7.9% on SourcePulse

View on GitHub

17 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Benjamin Bolte

Cofounder of K-Scale Labs

Stella Rose Biderman

Executive Director at EleutherAI

and 13 more!

Project Summary

This repository provides a PyTorch implementation for "Taming Transformers for High-Resolution Image Synthesis," enabling efficient and expressive image generation by combining convolutional VQGANs with autoregressive transformers. It's targeted at researchers and practitioners in computer vision and generative modeling looking to achieve state-of-the-art results in high-resolution image synthesis.

How It Works

The core approach uses a VQGAN (Vector Quantized Generative Adversarial Network) to learn a codebook of visual parts, effectively compressing images into discrete tokens. An autoregressive transformer then models the composition of these tokens, allowing for high-resolution synthesis. This hybrid approach leverages the efficiency of convolutions for local feature extraction and the global context modeling power of transformers.

Quick Start & Requirements

Install: Create and activate a conda environment using conda env create -f environment.yaml and conda activate taming.
Prerequisites: Python, PyTorch. Specific dataset preparation steps are detailed for ImageNet, CelebA-HQ, FFHQ, COCO, and ADE20k.
Resources: Pretrained models are available for various datasets, including ImageNet, FFHQ, and CelebA-HQ, with FID scores reported.
Demo: Streamlit demos are available for sampling and image completion. See Colab quickstart notebook.

Highlighted Details

Achieves state-of-the-art FID scores among autoregressive approaches for class-conditional ImageNet synthesis.
Offers accelerated sampling via caching of keys/values in self-attention.
Supports training on custom datasets with provided configuration files.
Includes models for scene image synthesis conditioned on bounding boxes.

Maintenance & Community

The project is associated with the CVPR 2021 paper. Updates in 2022 mention new pretrained VQGANs for Latent Diffusion Models and scene synthesis models.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

Data preparation for datasets like ImageNet can be time-consuming and requires significant disk space. Some models may require specific versions of dependencies (e.g., MiDaS v2.0 for depth map generation). The README mentions a bugfix for the quantizer, which is disabled by default for backward compatibility.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

38 stars in the last 30 days