LightningDiT  by hustvl

Image generation research paper using latent diffusion

created 7 months ago
1,071 stars

Top 35.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project addresses the optimization dilemma in latent diffusion models (LDMs) where improving reconstruction quality via larger tokenizers hinders generation performance due to increased computational costs. It offers a solution for researchers and practitioners seeking faster, more efficient training of high-fidelity diffusion models, achieving state-of-the-art results with significantly reduced training times.

How It Works

The core innovation is the Vision foundation model Aligned Variational AutoEncoder (VA-VAE), which aligns the latent space with pre-trained vision foundation models. This approach mitigates the difficulty of learning unconstrained high-dimensional latent spaces, enabling faster convergence for diffusion transformers. The project also introduces LightningDiT, an enhanced diffusion transformer (DiT) baseline built upon VA-VAE, featuring improved training strategies and architectural designs for accelerated training and superior generation quality.

Quick Start & Requirements

  • Installation: conda create -n lightningdit python=3.10.12, conda activate lightningdit, pip install -r requirements.txt.
  • Prerequisites: Python 3.10.12, PyTorch.
  • Training: Requires 8 x H800 GPUs for ~10 hours to achieve FID 2.1 within 64 epochs.
  • Resources: Pre-trained weights and latent statistics are available for download.
  • Links: Papers With Code, CVPR 2025 Paper, NeurIPS 2024 Paper.

Highlighted Details

  • Achieves FID 1.35 on ImageNet-256, surpassing DiT.
  • Offers over 21x faster convergence compared to original DiT implementations.
  • Reaches FID 2.11 in approximately 10 hours with 8 GPUs.
  • VA-VAE selected for Oral Presentation at CVPR 2025.

Maintenance & Community

The project is associated with hustvl and builds upon DiT, FastDiT, and SiT. Code for VA-VAE is based on LDM and MAR.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The FID results reported by the inference script are for reference; final FID-50k requires evaluation using OpenAI's guided-diffusion repository.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
13
Star History
365 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.