ThunderKittens  by HazyResearch

CUDA kernel framework for fast deep learning primitives

Created 1 year ago
2,862 stars

Top 16.6% on SourcePulse

GitHubView on GitHub
Project Summary

ThunderKittens provides a C++ framework for writing high-performance deep learning kernels in CUDA, targeting developers who need to optimize low-level GPU operations. It simplifies the creation of efficient kernels by abstracting complex hardware features like tensor cores and shared memory, enabling performance comparable to hand-written kernels.

How It Works

ThunderKittens is built around the principle of operating on small, fixed-size "tiles" of data, typically 16x16, reflecting modern GPU architectures. It exposes primitives for managing data in registers and shared memory, with explicit support for layouts and types. The framework facilitates asynchronous operations, worker overlapping, and direct access to hardware features like Tensor Cores (WGMMA) and Tensor Memory Accelerator (TMA) for optimized loads/stores, aiming to maximize GPU utilization.

Quick Start & Requirements

  • Install: Clone the repository and include kittens.cuh. For PyTorch bindings, cd kernels/example_bind and run python setup.py install after setting environment variables via source env.src.
  • Prerequisites: CUDA 12.3+, C++20 (GCC 11+ or Clang 11+ recommended).
  • Resources: Building tests can take a few minutes.
  • Docs: ThunderKittens Manual (in-progress).

Highlighted Details

  • Achieves ~155 TFLOPs (93% of theoretical max) on an RTX 4090 with a FlashAttention-2 kernel.
  • Supports FP8, asynchronous WGMMA calls, TMA loads/stores, and shared memory bank conflict avoidance.
  • Emphasizes compile-time layout and type checking for safety.
  • Designed for extensibility, allowing integration with custom C++ code.

Maintenance & Community

  • Active development with recent updates (Nov 2024) including FP8 support and improved PyTorch bindings.
  • Community engagement via the GPU Mode Discord.
  • Blog posts and a research paper are available for deeper understanding.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Primarily focused on CUDA; support for MPS and ROCm is planned but not yet implemented.
  • The manual is still in progress, and some "sharp edges" may require careful reading of the source code.
  • Requires a modern C++ compiler and CUDA toolkit, potentially necessitating environment setup.
Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
5
Star History
99 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.4%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 5 hours ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.5%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 1 month ago
Starred by Nathan Lambert Nathan Lambert(Research Scientist at AI2), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
7 more.

DeepGEMM by deepseek-ai

0.3%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 8 months ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 19 hours ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
28k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.