unet.cu  by clu0

CUDA-based UNet diffusion model training implementation

created 1 year ago
612 stars

Top 54.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project implements a UNet diffusion model entirely in C++/CUDA, targeting researchers and engineers interested in low-level CUDA optimization for deep learning. It aims to match or exceed PyTorch performance for diffusion model training by custom-optimizing core operations like 3x3 convolutions.

How It Works

The project builds a UNet architecture, a standard component in diffusion models, using custom CUDA kernels for performance. It iteratively refines implementations of key layers, starting with adaptations from llm.c and progressing to highly optimized custom kernels for convolutions. The approach focuses on minimizing memory transfers, maximizing GPU utilization through techniques like vectorized loads and shared memory, and avoiding expensive operations like global memory transpositions.

Quick Start & Requirements

  • Install/Run:
    1. gunzip data/elephant_train.bin.gz
    2. python train_unet.py --init_model_only True
    3. make train_unet
    4. ./train_unet
  • Prerequisites: CUDA, C++ compiler.
  • Data: Requires ImageNet 64x64 data or custom .bin file preparation.
  • Links: Official Quick Start

Highlighted Details

  • Custom CUDA kernels for 3x3 convolutions achieve significant speedups over naive implementations, reducing forward pass time from 15.1ms to 1.3ms.
  • The project benchmarks and profiles each optimization step, detailing bottlenecks like memory access patterns and warp stalls.
  • It aims to replicate PyTorch performance, with the latest version achieving ~70% of PyTorch's speed for the forward pass and ~30% for the backward pass.
  • Supports unconditional diffusion training with plans for future extensions.

Maintenance & Community

  • Primarily a personal project by clu0, inspired by llm.c and CUDA optimization blogs. No explicit community channels or roadmap are mentioned.

Licensing & Compatibility

  • The repository does not explicitly state a license.

Limitations & Caveats

  • Currently supports only unconditional diffusion.
  • Backward pass performance significantly lags behind PyTorch.
  • The project is focused on learning and optimization, not necessarily production-ready deployment.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

applied-ai by pytorch-labs

0.3%
289
Applied AI experiments and examples for PyTorch
created 2 years ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 1 day ago
Starred by Peter Norvig Peter Norvig(Author of Artificial Intelligence: A Modern Approach; Research Director at Google), Didier Lopes Didier Lopes(Founder of OpenBB), and
15 more.

llm.c by karpathy

0.2%
27k
LLM training in pure C/CUDA, no PyTorch needed
created 1 year ago
updated 1 month ago
Feedback? Help us improve.