CUDA-based UNet diffusion model training implementation
Top 54.5% on sourcepulse
This project implements a UNet diffusion model entirely in C++/CUDA, targeting researchers and engineers interested in low-level CUDA optimization for deep learning. It aims to match or exceed PyTorch performance for diffusion model training by custom-optimizing core operations like 3x3 convolutions.
How It Works
The project builds a UNet architecture, a standard component in diffusion models, using custom CUDA kernels for performance. It iteratively refines implementations of key layers, starting with adaptations from llm.c
and progressing to highly optimized custom kernels for convolutions. The approach focuses on minimizing memory transfers, maximizing GPU utilization through techniques like vectorized loads and shared memory, and avoiding expensive operations like global memory transpositions.
Quick Start & Requirements
gunzip data/elephant_train.bin.gz
python train_unet.py --init_model_only True
make train_unet
./train_unet
.bin
file preparation.Highlighted Details
Maintenance & Community
clu0
, inspired by llm.c
and CUDA optimization blogs. No explicit community channels or roadmap are mentioned.Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive