LeetCUDA  by xlite-dev

CUDA learning notes for beginners using PyTorch

Created 2 years ago
6,961 stars

Top 7.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a comprehensive collection of modern CUDA learning notes and optimized kernels for PyTorch users, particularly beginners. It aims to demystify CUDA programming and high-performance computing concepts by offering over 200 CUDA kernels, detailed blogs on LLM optimization, and implementations of advanced techniques like HGEMM and FlashAttention-MMA.

How It Works

LeetCUDA leverages PyTorch for Python bindings and showcases custom CUDA kernel implementations. It focuses on optimizing matrix multiplication (HGEMM) and attention mechanisms (FlashAttention-MMA) by utilizing Tensor Cores (WMMA, MMA, CuTe) and advanced tiling, memory access patterns, and multi-stage processing. This approach aims to achieve near-peak hardware performance and provide a deep understanding of GPU architecture.

Quick Start & Requirements

  • Installation: Primarily involves cloning the repository and potentially building custom CUDA kernels with PyTorch bindings. Specific build commands are not detailed in the README.
  • Prerequisites: NVIDIA GPU with CUDA toolkit, PyTorch. Specific CUDA versions or hardware capabilities (e.g., Tensor Cores) are beneficial for running optimized kernels.
  • Resources: Building and running custom CUDA kernels can be resource-intensive, requiring a development environment with a compatible NVIDIA GPU.

Highlighted Details

  • Over 200 CUDA kernels covering a wide range of complexities, from basic element-wise operations to advanced HGEMM and FlashAttention implementations.
  • HGEMM kernels achieve 98%-100% of cuBLAS performance on various NVIDIA GPUs.
  • FlashAttention-MMA implementations offer significant speedups over standard implementations for certain configurations, particularly for smaller sequence lengths or specific hardware.
  • Extensive blog content explaining LLM inference optimization, CUDA programming, and GPU architecture.

Maintenance & Community

The project is actively maintained by xlite-dev and welcomes contributions. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The repository is licensed under the GNU General Public License v3.0 (GPL-3.0). This license is copyleft, meaning derivative works must also be open-sourced under the same license, which may restrict commercial or closed-source integration.

Limitations & Caveats

The README indicates that some FlashAttention-MMA implementations may have performance gaps for large-scale attention compared to established libraries, and some features are marked as deprecated or under refactoring. The setup process for building custom kernels might require significant CUDA development expertise.

Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
6
Star History
1,021 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
15 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
28k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.