LeetCUDA by xlite-dev

CUDA learning notes for beginners using PyTorch

Created 3 years ago

9,285 stars

Top 5.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ying Sheng

Coauthor of SGLang

Project Summary

This repository provides a comprehensive collection of modern CUDA learning notes and optimized kernels for PyTorch users, particularly beginners. It aims to demystify CUDA programming and high-performance computing concepts by offering over 200 CUDA kernels, detailed blogs on LLM optimization, and implementations of advanced techniques like HGEMM and FlashAttention-MMA.

How It Works

LeetCUDA leverages PyTorch for Python bindings and showcases custom CUDA kernel implementations. It focuses on optimizing matrix multiplication (HGEMM) and attention mechanisms (FlashAttention-MMA) by utilizing Tensor Cores (WMMA, MMA, CuTe) and advanced tiling, memory access patterns, and multi-stage processing. This approach aims to achieve near-peak hardware performance and provide a deep understanding of GPU architecture.

Quick Start & Requirements

Installation: Primarily involves cloning the repository and potentially building custom CUDA kernels with PyTorch bindings. Specific build commands are not detailed in the README.
Prerequisites: NVIDIA GPU with CUDA toolkit, PyTorch. Specific CUDA versions or hardware capabilities (e.g., Tensor Cores) are beneficial for running optimized kernels.
Resources: Building and running custom CUDA kernels can be resource-intensive, requiring a development environment with a compatible NVIDIA GPU.

Highlighted Details

Over 200 CUDA kernels covering a wide range of complexities, from basic element-wise operations to advanced HGEMM and FlashAttention implementations.
HGEMM kernels achieve 98%-100% of cuBLAS performance on various NVIDIA GPUs.
FlashAttention-MMA implementations offer significant speedups over standard implementations for certain configurations, particularly for smaller sequence lengths or specific hardware.
Extensive blog content explaining LLM inference optimization, CUDA programming, and GPU architecture.

Maintenance & Community

The project is actively maintained by xlite-dev and welcomes contributions. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The repository is licensed under the GNU General Public License v3.0 (GPL-3.0). This license is copyleft, meaning derivative works must also be open-sourced under the same license, which may restrict commercial or closed-source integration.

Limitations & Caveats

The README indicates that some FlashAttention-MMA implementations may have performance gaps for large-scale attention compared to established libraries, and some features are marked as deprecated or under refactoring. The setup process for building custom kernels might require significant CUDA development expertise.

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

413 stars in the last 30 days