taming-transformers  by CompVis

Image synthesis research paper using transformers

created 4 years ago
6,271 stars

Top 8.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation for "Taming Transformers for High-Resolution Image Synthesis," enabling efficient and expressive image generation by combining convolutional VQGANs with autoregressive transformers. It's targeted at researchers and practitioners in computer vision and generative modeling looking to achieve state-of-the-art results in high-resolution image synthesis.

How It Works

The core approach uses a VQGAN (Vector Quantized Generative Adversarial Network) to learn a codebook of visual parts, effectively compressing images into discrete tokens. An autoregressive transformer then models the composition of these tokens, allowing for high-resolution synthesis. This hybrid approach leverages the efficiency of convolutions for local feature extraction and the global context modeling power of transformers.

Quick Start & Requirements

  • Install: Create and activate a conda environment using conda env create -f environment.yaml and conda activate taming.
  • Prerequisites: Python, PyTorch. Specific dataset preparation steps are detailed for ImageNet, CelebA-HQ, FFHQ, COCO, and ADE20k.
  • Resources: Pretrained models are available for various datasets, including ImageNet, FFHQ, and CelebA-HQ, with FID scores reported.
  • Demo: Streamlit demos are available for sampling and image completion. See Colab quickstart notebook.

Highlighted Details

  • Achieves state-of-the-art FID scores among autoregressive approaches for class-conditional ImageNet synthesis.
  • Offers accelerated sampling via caching of keys/values in self-attention.
  • Supports training on custom datasets with provided configuration files.
  • Includes models for scene image synthesis conditioned on bounding boxes.

Maintenance & Community

The project is associated with the CVPR 2021 paper. Updates in 2022 mention new pretrained VQGANs for Latent Diffusion Models and scene synthesis models.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

Data preparation for datasets like ImageNet can be time-consuming and requires significant disk space. Some models may require specific versions of dependencies (e.g., MiDaS v2.0 for depth map generation). The README mentions a bugfix for the quantizer, which is disabled by default for backward compatibility.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
144 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Travis Fischer Travis Fischer(Founder of Agentic), and
3 more.

consistency_models by openai

0.0%
6k
PyTorch code for consistency models research paper
created 2 years ago
updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
3 more.

guided-diffusion by openai

0.2%
7k
Image synthesis codebase for diffusion models
created 4 years ago
updated 1 year ago
Feedback? Help us improve.