TileKernels by deepseek-ai

Optimized GPU kernels for LLM operations

Created 6 days ago

New!

1,279 stars

Top 30.7% on SourcePulse

View on GitHub

2 Experts Love This Project

George Hotz

Author of tinygrad; Founder of the tiny corp, comma.ai

Wing Lian

Founder of Axolotl AI

Project Summary

Summary

TileKernels provides a library of highly optimized GPU kernels for Large Language Model (LLM) operations, developed using TileLang. This project targets engineers and researchers seeking to maximize LLM training and inference performance by leveraging kernels that approach hardware limits for compute intensity and memory bandwidth. Its core benefit lies in delivering state-of-the-art performance through an agile development framework.

How It Works

The project utilizes TileLang, a domain-specific language embedded in Python, for expressing and automatically optimizing high-performance GPU kernels. This approach facilitates easy migration of existing kernels and enables agile development cycles. Key architectural choices focus on pushing compute intensity and memory bandwidth utilization towards hardware ceilings, incorporating specialized kernels for Mixture of Experts (MoE) routing, advanced quantization (FP8/FP4/E5M6), batched transposes, Engram gating, and Manifold HyperConnection (mHC) operations.

Quick Start & Requirements

Primary Install:
- Development: pip install -e ".[dev]"
- Release: pip install tile-kernels
Prerequisites: Python 3.10+, PyTorch 2.10+, TileLang 0.1.9+, NVIDIA SM90/SM100 GPU architecture, CUDA Toolkit 13.1+.
Testing: Utilizes pytest for correctness and benchmarking.
Documentation: No specific quick-start or documentation links are provided beyond the repository itself.

Highlighted Details

Mixture of Experts (MoE): Features kernels for top-k expert selection, gating, and routing.
Quantization: Implements per-token, per-block, and per-channel FP8/FP4/E5M6 casting, including fused SwiGLU operations.
Specialized Kernels: Includes batched transpose operations, Engram gating with fused RMSNorm and gradient reduction, and Manifold HyperConnection (mHC) kernels with Sinkhorn normalization.
High-Level Abstractions: Offers torch.autograd.Function wrappers to compose low-level kernels into trainable PyTorch layers.

Maintenance & Community

The project lists authors in its citation but provides no specific details regarding active maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

Released under the permissive MIT License, allowing for broad compatibility with commercial use and integration into closed-source projects.

Limitations & Caveats

The project explicitly states that current kernels "do not represent best practices" and are undergoing active improvement in code quality and documentation. Adoption requires specific, high-end NVIDIA hardware (SM90/SM100 GPUs) and a recent CUDA toolkit version.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,289 stars in the last 6 days