LeetCUDA  by xlite-dev

CUDA learning notes for beginners using PyTorch

created 2 years ago
5,793 stars

Top 9.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a comprehensive collection of modern CUDA learning notes and optimized kernels for PyTorch users, particularly beginners. It aims to demystify CUDA programming and high-performance computing concepts by offering over 200 CUDA kernels, detailed blogs on LLM optimization, and implementations of advanced techniques like HGEMM and FlashAttention-MMA.

How It Works

LeetCUDA leverages PyTorch for Python bindings and showcases custom CUDA kernel implementations. It focuses on optimizing matrix multiplication (HGEMM) and attention mechanisms (FlashAttention-MMA) by utilizing Tensor Cores (WMMA, MMA, CuTe) and advanced tiling, memory access patterns, and multi-stage processing. This approach aims to achieve near-peak hardware performance and provide a deep understanding of GPU architecture.

Quick Start & Requirements

  • Installation: Primarily involves cloning the repository and potentially building custom CUDA kernels with PyTorch bindings. Specific build commands are not detailed in the README.
  • Prerequisites: NVIDIA GPU with CUDA toolkit, PyTorch. Specific CUDA versions or hardware capabilities (e.g., Tensor Cores) are beneficial for running optimized kernels.
  • Resources: Building and running custom CUDA kernels can be resource-intensive, requiring a development environment with a compatible NVIDIA GPU.

Highlighted Details

  • Over 200 CUDA kernels covering a wide range of complexities, from basic element-wise operations to advanced HGEMM and FlashAttention implementations.
  • HGEMM kernels achieve 98%-100% of cuBLAS performance on various NVIDIA GPUs.
  • FlashAttention-MMA implementations offer significant speedups over standard implementations for certain configurations, particularly for smaller sequence lengths or specific hardware.
  • Extensive blog content explaining LLM inference optimization, CUDA programming, and GPU architecture.

Maintenance & Community

The project is actively maintained by xlite-dev and welcomes contributions. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The repository is licensed under the GNU General Public License v3.0 (GPL-3.0). This license is copyleft, meaning derivative works must also be open-sourced under the same license, which may restrict commercial or closed-source integration.

Limitations & Caveats

The README indicates that some FlashAttention-MMA implementations may have performance gaps for large-scale attention compared to established libraries, and some features are marked as deprecated or under refactoring. The setup process for building custom kernels might require significant CUDA development expertise.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
12
Star History
1,975 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
7 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
created 1 year ago
updated 3 days ago
Feedback? Help us improve.