Image generation research paper using latent diffusion
Top 35.9% on sourcepulse
This project addresses the optimization dilemma in latent diffusion models (LDMs) where improving reconstruction quality via larger tokenizers hinders generation performance due to increased computational costs. It offers a solution for researchers and practitioners seeking faster, more efficient training of high-fidelity diffusion models, achieving state-of-the-art results with significantly reduced training times.
How It Works
The core innovation is the Vision foundation model Aligned Variational AutoEncoder (VA-VAE), which aligns the latent space with pre-trained vision foundation models. This approach mitigates the difficulty of learning unconstrained high-dimensional latent spaces, enabling faster convergence for diffusion transformers. The project also introduces LightningDiT, an enhanced diffusion transformer (DiT) baseline built upon VA-VAE, featuring improved training strategies and architectural designs for accelerated training and superior generation quality.
Quick Start & Requirements
conda create -n lightningdit python=3.10.12
, conda activate lightningdit
, pip install -r requirements.txt
.Highlighted Details
Maintenance & Community
The project is associated with hustvl and builds upon DiT, FastDiT, and SiT. Code for VA-VAE is based on LDM and MAR.
Licensing & Compatibility
Limitations & Caveats
The FID results reported by the inference script are for reference; final FID-50k requires evaluation using OpenAI's guided-diffusion repository.
1 month ago
1 day