RAE  by bytetriper

High-fidelity image synthesis via Representation Autoencoders

Created 2 months ago
1,598 stars

Top 26.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

This repository provides the official PyTorch implementation for "Diffusion Transformers with Representation Autoencoders" (RAE), a novel approach to high-fidelity image synthesis. It targets researchers and engineers by enabling the generation of high-quality images through a two-stage pipeline that leverages pre-trained, frozen representation encoders with trainable Vision Transformer decoders. The benefit lies in efficient and effective image generation by building upon established visual representations.

How It Works

RAE introduces autoencoders that utilize frozen, pre-trained representation encoders (e.g., DINOv2, SigLIP2) paired with trainable Vision Transformer (ViT) decoders. This design choice allows the model to benefit from powerful, general-purpose visual features without needing to train the encoder from scratch. The system employs a two-stage training pipeline: first, a Representation Autoencoder (RAE) is trained, and second, a diffusion model (like DiT) is trained on the latent space generated by the RAE, facilitating high-fidelity image synthesis.

Quick Start & Requirements

Installation requires Python 3.10 and Conda. Key dependencies include PyTorch 2.2.0 with CUDA 12.1, timm==0.9.16, accelerate==0.23.0, torchdiffeq==0.2.5, wandb, numpy<2, transformers, einops, and omegaconf. Environment setup involves creating a Conda environment and installing packages using uv. Pre-trained models (RAE decoders, DiT DH) can be downloaded using hf download nyu-visionx/RAE-collections --local-dir models. Users must prepare the ImageNet-1k dataset and point training/sampling scripts to its location via the --data-path argument. Configuration is managed via OmegaConf YAML files. Training is launched using torchrun with scripts like src/train_stage1.py and src/train.py. Sampling is performed using src/sample.py or src/sample_ddp.py. TPU support is available via an XLA branch.

Highlighted Details

  • Official PyTorch and TorchXLA/TPU implementations for RAE and Diffusion Transformers (DiT).
  • Leverages frozen, pre-trained encoders (DINOv2, SigLIP2) within the RAE framework.
  • Features a two-stage pipeline for high-fidelity image synthesis.
  • Includes implementations for RAE, LightningDiT, and DiT DH models.
  • Configuration-driven approach using OmegaConf YAML files for flexibility.
  • Supports distributed training (PyTorch DDP) and sampling for scalability.
  • Provides scripts for training, sampling, reconstruction, and evaluation.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The license type and any compatibility notes for commercial or closed-source use are not explicitly stated in the provided README content.

Limitations & Caveats

TPU support is noted as being available in an "XLA branch," suggesting it may require separate checkout or could be experimental. The setup mandates specific versions of PyTorch (2.2.0) and CUDA (12.1), potentially limiting compatibility with other environments. Preparation of the ImageNet-1k dataset is a necessary prerequisite for training and evaluation.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
16
Star History
170 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), and
3 more.

taesd by madebyollin

0.4%
821
Tiny AutoEncoder for Stable Diffusion latents
Created 2 years ago
Updated 3 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
12 more.

IF by deep-floyd

0.0%
8k
Text-to-image model for photorealistic synthesis and language understanding
Created 2 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
15 more.

taming-transformers by CompVis

0.2%
6k
Image synthesis research paper using transformers
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.