Discover and explore top open-source AI tools and projects—updated daily.
HazyResearchCUDA kernel framework for fast deep learning primitives
Top 16.6% on SourcePulse
ThunderKittens provides a C++ framework for writing high-performance deep learning kernels in CUDA, targeting developers who need to optimize low-level GPU operations. It simplifies the creation of efficient kernels by abstracting complex hardware features like tensor cores and shared memory, enabling performance comparable to hand-written kernels.
How It Works
ThunderKittens is built around the principle of operating on small, fixed-size "tiles" of data, typically 16x16, reflecting modern GPU architectures. It exposes primitives for managing data in registers and shared memory, with explicit support for layouts and types. The framework facilitates asynchronous operations, worker overlapping, and direct access to hardware features like Tensor Cores (WGMMA) and Tensor Memory Accelerator (TMA) for optimized loads/stores, aiming to maximize GPU utilization.
Quick Start & Requirements
kittens.cuh. For PyTorch bindings, cd kernels/example_bind and run python setup.py install after setting environment variables via source env.src.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 week ago
1 week
ELS-RD
luminal-ai
baidu-research
linkedin
gpu-mode
deepseek-ai
Dao-AILab
karpathy