k-diffusion  by crowsonkb

PyTorch implementation of Karras et al. (2022) diffusion models

Created 3 years ago
2,510 stars

Top 18.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an implementation of the Karras et al. (2022) diffusion models for PyTorch, targeting researchers and practitioners in generative AI. It offers enhanced sampling algorithms, transformer-based diffusion models, and utilities for training and evaluation, aiming to improve sample quality and training efficiency.

How It Works

The core of k-diffusion is its implementation of diffusion models, including the Karras et al. (2022) paper's techniques. It introduces a novel image_transformer_v2 model type, inspired by Hourglass Transformer and DiT, which utilizes hierarchical transformers. This architecture employs efficient attention mechanisms like neighborhood attention (via NATTEN) and global attention (via FlashAttention-2) at different levels of the hierarchy, allowing for a flexible trade-off between performance and custom CUDA kernel requirements. It also incorporates soft Min-SNR loss weighting for improved high-resolution training.

Quick Start & Requirements

  • Install library code: pip install k-diffusion
  • For training/inference scripts: pip install -e <path to repository>
  • Custom CUDA kernels (NATTEN, FlashAttention-2) are recommended for optimal performance with image_transformer_v2.
  • PyTorch installation should support torch.compile().
  • Training example: python train.py --config configs/config_oxford_flowers_shifted_window.json --name flowers_demo_001 --evaluate-n 0 --batch-size 32 --sample-n 36 --mixed-precision bf16
  • Official docs: Not explicitly linked, but repository structure implies usage via scripts.

Highlighted Details

  • Implements DPM-Solver, DPM-Solver++(2S), and (2M) for improved sampling quality and adaptive step size control.
  • Supports CLIP-guided sampling from unconditional models.
  • Includes wrappers for v-diffusion-pytorch, OpenAI diffusion, and CompVis diffusion models.
  • Enables log likelihood calculation and FID/KID metrics during training.
  • Supports multi-GPU and multi-node training via Hugging Face Accelerate.

Maintenance & Community

  • No specific contributors, sponsorships, or community links (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

The shifted window attention variant of image_transformer_v2 performs worse and is slower than the NATTEN-based version. Models trained with one attention type require fine-tuning to be used with a different type. The inference section is marked as "TODO".

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Wing Lian Wing Lian(Founder of Axolotl AI).

LLaDA by ML-GSAI

1.7%
3k
LLM research paper exploring masked diffusion language models
Created 7 months ago
Updated 1 day ago
Feedback? Help us improve.