ThunderKittens by HazyResearch

CUDA kernel framework for fast deep learning primitives

Created 1 year ago

3,043 stars

Top 15.6% on SourcePulse

View on GitHub

19 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Vincent Weisser

Cofounder of Prime Intellect

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Georgios Konstantopoulos

CTO, General Partner at Paradigm

and 15 more!

Project Summary

ThunderKittens provides a C++ framework for writing high-performance deep learning kernels in CUDA, targeting developers who need to optimize low-level GPU operations. It simplifies the creation of efficient kernels by abstracting complex hardware features like tensor cores and shared memory, enabling performance comparable to hand-written kernels.

How It Works

ThunderKittens is built around the principle of operating on small, fixed-size "tiles" of data, typically 16x16, reflecting modern GPU architectures. It exposes primitives for managing data in registers and shared memory, with explicit support for layouts and types. The framework facilitates asynchronous operations, worker overlapping, and direct access to hardware features like Tensor Cores (WGMMA) and Tensor Memory Accelerator (TMA) for optimized loads/stores, aiming to maximize GPU utilization.

Quick Start & Requirements

Install: Clone the repository and include kittens.cuh. For PyTorch bindings, cd kernels/example_bind and run python setup.py install after setting environment variables via source env.src.
Prerequisites: CUDA 12.3+, C++20 (GCC 11+ or Clang 11+ recommended).
Resources: Building tests can take a few minutes.
Docs: ThunderKittens Manual (in-progress).

Highlighted Details

Achieves ~155 TFLOPs (93% of theoretical max) on an RTX 4090 with a FlashAttention-2 kernel.
Supports FP8, asynchronous WGMMA calls, TMA loads/stores, and shared memory bank conflict avoidance.
Emphasizes compile-time layout and type checking for safety.
Designed for extensibility, allowing integration with custom C++ code.

Maintenance & Community

Active development with recent updates (Nov 2024) including FP8 support and improved PyTorch bindings.
Community engagement via the GPU Mode Discord.
Blog posts and a research paper are available for deeper understanding.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Primarily focused on CUDA; support for MPS and ROCm is planned but not yet implemented.
The manual is still in progress, and some "sharp edges" may require careful reading of the source code.
Requires a modern C++ compiler and CUDA toolkit, potentially necessitating environment setup.

Health Check

Last Commit

11 hours ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

54 stars in the last 30 days