ThunderKittens  by HazyResearch

CUDA kernel framework for fast deep learning primitives

created 1 year ago
2,538 stars

Top 18.8% on sourcepulse

GitHubView on GitHub
Project Summary

ThunderKittens provides a C++ framework for writing high-performance deep learning kernels in CUDA, targeting developers who need to optimize low-level GPU operations. It simplifies the creation of efficient kernels by abstracting complex hardware features like tensor cores and shared memory, enabling performance comparable to hand-written kernels.

How It Works

ThunderKittens is built around the principle of operating on small, fixed-size "tiles" of data, typically 16x16, reflecting modern GPU architectures. It exposes primitives for managing data in registers and shared memory, with explicit support for layouts and types. The framework facilitates asynchronous operations, worker overlapping, and direct access to hardware features like Tensor Cores (WGMMA) and Tensor Memory Accelerator (TMA) for optimized loads/stores, aiming to maximize GPU utilization.

Quick Start & Requirements

  • Install: Clone the repository and include kittens.cuh. For PyTorch bindings, cd kernels/example_bind and run python setup.py install after setting environment variables via source env.src.
  • Prerequisites: CUDA 12.3+, C++20 (GCC 11+ or Clang 11+ recommended).
  • Resources: Building tests can take a few minutes.
  • Docs: ThunderKittens Manual (in-progress).

Highlighted Details

  • Achieves ~155 TFLOPs (93% of theoretical max) on an RTX 4090 with a FlashAttention-2 kernel.
  • Supports FP8, asynchronous WGMMA calls, TMA loads/stores, and shared memory bank conflict avoidance.
  • Emphasizes compile-time layout and type checking for safety.
  • Designed for extensibility, allowing integration with custom C++ code.

Maintenance & Community

  • Active development with recent updates (Nov 2024) including FP8 support and improved PyTorch bindings.
  • Community engagement via the GPU Mode Discord.
  • Blog posts and a research paper are available for deeper understanding.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Primarily focused on CUDA; support for MPS and ROCm is planned but not yet implemented.
  • The manual is still in progress, and some "sharp edges" may require careful reading of the source code.
  • Requires a modern C++ compiler and CUDA toolkit, potentially necessitating environment setup.
Health Check
Last commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)
6
Issues (30d)
3
Star History
244 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

gpu.cpp by AnswerDotAI

0.2%
4k
C++ library for portable GPU computation using WebGPU
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.