RAE by bytetriper

High-fidelity image synthesis via Representation Autoencoders

Created 5 months ago

1,805 stars

Top 23.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Saining Xie

Professor at NYU

Project Summary

Summary

This repository provides the official PyTorch implementation for "Diffusion Transformers with Representation Autoencoders" (RAE), a novel approach to high-fidelity image synthesis. It targets researchers and engineers by enabling the generation of high-quality images through a two-stage pipeline that leverages pre-trained, frozen representation encoders with trainable Vision Transformer decoders. The benefit lies in efficient and effective image generation by building upon established visual representations.

How It Works

RAE introduces autoencoders that utilize frozen, pre-trained representation encoders (e.g., DINOv2, SigLIP2) paired with trainable Vision Transformer (ViT) decoders. This design choice allows the model to benefit from powerful, general-purpose visual features without needing to train the encoder from scratch. The system employs a two-stage training pipeline: first, a Representation Autoencoder (RAE) is trained, and second, a diffusion model (like DiT) is trained on the latent space generated by the RAE, facilitating high-fidelity image synthesis.

Quick Start & Requirements

Installation requires Python 3.10 and Conda. Key dependencies include PyTorch 2.2.0 with CUDA 12.1, timm==0.9.16, accelerate==0.23.0, torchdiffeq==0.2.5, wandb, numpy<2, transformers, einops, and omegaconf. Environment setup involves creating a Conda environment and installing packages using uv. Pre-trained models (RAE decoders, DiT DH) can be downloaded using hf download nyu-visionx/RAE-collections --local-dir models. Users must prepare the ImageNet-1k dataset and point training/sampling scripts to its location via the --data-path argument. Configuration is managed via OmegaConf YAML files. Training is launched using torchrun with scripts like src/train_stage1.py and src/train.py. Sampling is performed using src/sample.py or src/sample_ddp.py. TPU support is available via an XLA branch.

Highlighted Details

Official PyTorch and TorchXLA/TPU implementations for RAE and Diffusion Transformers (DiT).
Leverages frozen, pre-trained encoders (DINOv2, SigLIP2) within the RAE framework.
Features a two-stage pipeline for high-fidelity image synthesis.
Includes implementations for RAE, LightningDiT, and DiT DH models.
Configuration-driven approach using OmegaConf YAML files for flexibility.
Supports distributed training (PyTorch DDP) and sampling for scalability.
Provides scripts for training, sampling, reconstruction, and evaluation.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The license type and any compatibility notes for commercial or closed-source use are not explicitly stated in the provided README content.

Limitations & Caveats

TPU support is noted as being available in an "XLA branch," suggesting it may require separate checkout or could be experimental. The setup mandates specific versions of PyTorch (2.2.0) and CUDA (12.1), potentially limiting compatibility with other environments. Preparation of the ImageNet-1k dataset is a necessary prerequisite for training and evaluation.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

50 stars in the last 30 days