SFD  by yuemingPAN

Novel latent diffusion paradigm for accelerated, high-fidelity image generation

Created 2 months ago
297 stars

Top 89.4% on SourcePulse

GitHubView on GitHub
Project Summary

Semantics Lead the Way (SFD) introduces a novel latent diffusion paradigm that harmonizes semantic and texture modeling for image generation. It addresses the limitation of synchronous denoising in existing Latent Diffusion Models (LDMs) by explicitly prioritizing semantic formation, enabling earlier semantic denoising to guide texture generation. This approach offers state-of-the-art FID scores and significantly accelerates training convergence, making it beneficial for researchers and practitioners in generative AI seeking high-quality, efficient image synthesis.

How It Works

SFD constructs composite latents by combining compact semantic representations from a pre-trained visual encoder with texture latents. It employs asynchronous denoising with separate noise schedules, allowing semantic latents to denoise first, establishing a semantic anchor. This is followed by a joint but asynchronous denoising phase where semantics lead textures, and finally, a texture completion phase. This explicit, semantics-led, coarse-to-fine generation process leverages the inherent structure of LDMs for improved quality and faster convergence.

Quick Start & Requirements

Highlighted Details

  • Achieves state-of-the-art FID scores, including 1.04 on ImageNet 256x256 with the 1.0B LightningDiT-XXL model.
  • Demonstrates remarkable training convergence acceleration: approximately 100x faster than DiT and 33.3x faster than LightningDiT.
  • Improves upon existing methods like ReDi and VA-VAE through its asynchronous, semantics-led modeling approach.
  • Offers enhanced performance with an "AutoGuidance" module, achieving FID scores as low as 1.03 (SFD-XL, 800 epochs) and 1.04 (SFD-XXL, 800 epochs).

Maintenance & Community

The project's code is based on LightningDiT, REPA, and ADM repositories. No specific community channels (e.g., Discord, Slack), roadmap, or dedicated maintenance team beyond the listed authors are mentioned in the README.

Licensing & Compatibility

The README does not specify a license. As is common with research publications, it is likely intended for non-commercial, research-only use. Compatibility for commercial applications or linking with closed-source projects is not addressed.

Limitations & Caveats

The training code for the Semantic VAE and the main SFD diffusion model is currently listed as a to-do item and is not yet released. Performance results are primarily based on 16 NPU hardware, with minor discrepancies noted on A100 GPUs, suggesting potential hardware-specific tuning or precision differences.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.