Discover and explore top open-source AI tools and projects—updated daily.
bytetriperHigh-fidelity image synthesis via Representation Autoencoders
Top 26.1% on SourcePulse
Summary
This repository provides the official PyTorch implementation for "Diffusion Transformers with Representation Autoencoders" (RAE), a novel approach to high-fidelity image synthesis. It targets researchers and engineers by enabling the generation of high-quality images through a two-stage pipeline that leverages pre-trained, frozen representation encoders with trainable Vision Transformer decoders. The benefit lies in efficient and effective image generation by building upon established visual representations.
How It Works
RAE introduces autoencoders that utilize frozen, pre-trained representation encoders (e.g., DINOv2, SigLIP2) paired with trainable Vision Transformer (ViT) decoders. This design choice allows the model to benefit from powerful, general-purpose visual features without needing to train the encoder from scratch. The system employs a two-stage training pipeline: first, a Representation Autoencoder (RAE) is trained, and second, a diffusion model (like DiT) is trained on the latent space generated by the RAE, facilitating high-fidelity image synthesis.
Quick Start & Requirements
Installation requires Python 3.10 and Conda. Key dependencies include PyTorch 2.2.0 with CUDA 12.1, timm==0.9.16, accelerate==0.23.0, torchdiffeq==0.2.5, wandb, numpy<2, transformers, einops, and omegaconf. Environment setup involves creating a Conda environment and installing packages using uv. Pre-trained models (RAE decoders, DiT DH) can be downloaded using hf download nyu-visionx/RAE-collections --local-dir models. Users must prepare the ImageNet-1k dataset and point training/sampling scripts to its location via the --data-path argument. Configuration is managed via OmegaConf YAML files. Training is launched using torchrun with scripts like src/train_stage1.py and src/train.py. Sampling is performed using src/sample.py or src/sample_ddp.py. TPU support is available via an XLA branch.
Highlighted Details
Maintenance & Community
The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.
Licensing & Compatibility
The license type and any compatibility notes for commercial or closed-source use are not explicitly stated in the provided README content.
Limitations & Caveats
TPU support is noted as being available in an "XLA branch," suggesting it may require separate checkout or could be experimental. The setup mandates specific versions of PyTorch (2.2.0) and CUDA (12.1), potentially limiting compatibility with other environments. Preparation of the ImageNet-1k dataset is a necessary prerequisite for training and evaluation.
2 weeks ago
Inactive
madebyollin
luosiallen
deep-floyd
CompVis
CompVis