CUDA kernel framework for fast deep learning primitives
Top 18.8% on sourcepulse
ThunderKittens provides a C++ framework for writing high-performance deep learning kernels in CUDA, targeting developers who need to optimize low-level GPU operations. It simplifies the creation of efficient kernels by abstracting complex hardware features like tensor cores and shared memory, enabling performance comparable to hand-written kernels.
How It Works
ThunderKittens is built around the principle of operating on small, fixed-size "tiles" of data, typically 16x16, reflecting modern GPU architectures. It exposes primitives for managing data in registers and shared memory, with explicit support for layouts and types. The framework facilitates asynchronous operations, worker overlapping, and direct access to hardware features like Tensor Cores (WGMMA) and Tensor Memory Accelerator (TMA) for optimized loads/stores, aiming to maximize GPU utilization.
Quick Start & Requirements
kittens.cuh
. For PyTorch bindings, cd kernels/example_bind
and run python setup.py install
after setting environment variables via source env.src
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
3 days ago
1 week