unet.cu by clu0

CUDA-based UNet diffusion model training implementation

Created 1 year ago

662 stars

Top 50.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Project Summary

This project implements a UNet diffusion model entirely in C++/CUDA, targeting researchers and engineers interested in low-level CUDA optimization for deep learning. It aims to match or exceed PyTorch performance for diffusion model training by custom-optimizing core operations like 3x3 convolutions.

How It Works

The project builds a UNet architecture, a standard component in diffusion models, using custom CUDA kernels for performance. It iteratively refines implementations of key layers, starting with adaptations from llm.c and progressing to highly optimized custom kernels for convolutions. The approach focuses on minimizing memory transfers, maximizing GPU utilization through techniques like vectorized loads and shared memory, and avoiding expensive operations like global memory transpositions.

Quick Start & Requirements

Install/Run:
1. gunzip data/elephant_train.bin.gz
2. python train_unet.py --init_model_only True
3. make train_unet
4. ./train_unet
Prerequisites: CUDA, C++ compiler.
Data: Requires ImageNet 64x64 data or custom .bin file preparation.
Links: Official Quick Start

Highlighted Details

Custom CUDA kernels for 3x3 convolutions achieve significant speedups over naive implementations, reducing forward pass time from 15.1ms to 1.3ms.
The project benchmarks and profiles each optimization step, detailing bottlenecks like memory access patterns and warp stalls.
It aims to replicate PyTorch performance, with the latest version achieving ~70% of PyTorch's speed for the forward pass and ~30% for the backward pass.
Supports unconditional diffusion training with plans for future extensions.

Maintenance & Community

Primarily a personal project by clu0, inspired by llm.c and CUDA optimization blogs. No explicit community channels or roadmap are mentioned.

Licensing & Compatibility

The repository does not explicitly state a license.

Limitations & Caveats

Currently supports only unconditional diffusion.
Backward pass performance significantly lags behind PyTorch.
The project is focused on learning and optimization, not necessarily production-ready deployment.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days