LightningDiT  by hustvl

Image generation research paper using latent diffusion

Created 8 months ago
1,176 stars

Top 33.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project addresses the optimization dilemma in latent diffusion models (LDMs) where improving reconstruction quality via larger tokenizers hinders generation performance due to increased computational costs. It offers a solution for researchers and practitioners seeking faster, more efficient training of high-fidelity diffusion models, achieving state-of-the-art results with significantly reduced training times.

How It Works

The core innovation is the Vision foundation model Aligned Variational AutoEncoder (VA-VAE), which aligns the latent space with pre-trained vision foundation models. This approach mitigates the difficulty of learning unconstrained high-dimensional latent spaces, enabling faster convergence for diffusion transformers. The project also introduces LightningDiT, an enhanced diffusion transformer (DiT) baseline built upon VA-VAE, featuring improved training strategies and architectural designs for accelerated training and superior generation quality.

Quick Start & Requirements

  • Installation: conda create -n lightningdit python=3.10.12, conda activate lightningdit, pip install -r requirements.txt.
  • Prerequisites: Python 3.10.12, PyTorch.
  • Training: Requires 8 x H800 GPUs for ~10 hours to achieve FID 2.1 within 64 epochs.
  • Resources: Pre-trained weights and latent statistics are available for download.
  • Links: Papers With Code, CVPR 2025 Paper, NeurIPS 2024 Paper.

Highlighted Details

  • Achieves FID 1.35 on ImageNet-256, surpassing DiT.
  • Offers over 21x faster convergence compared to original DiT implementations.
  • Reaches FID 2.11 in approximately 10 hours with 8 GPUs.
  • VA-VAE selected for Oral Presentation at CVPR 2025.

Maintenance & Community

The project is associated with hustvl and builds upon DiT, FastDiT, and SiT. Code for VA-VAE is based on LDM and MAR.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The FID results reported by the inference script are for reference; final FID-50k requires evaluation using OpenAI's guided-diffusion repository.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
7
Star History
78 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), and
3 more.

taesd by madebyollin

0.3%
779
Tiny AutoEncoder for Stable Diffusion latents
Created 2 years ago
Updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
1 more.

Sana by NVlabs

0.4%
4k
Image synthesis research paper using a linear diffusion transformer
Created 11 months ago
Updated 5 days ago
Starred by Robin Rombach Robin Rombach(Cofounder of Black Forest Labs), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

Kandinsky-2 by ai-forever

0.0%
3k
Multilingual text-to-image latent diffusion model
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.