TileKernels  by deepseek-ai

Optimized GPU kernels for LLM operations

Created 6 days ago

New!

1,279 stars

Top 30.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TileKernels provides a library of highly optimized GPU kernels for Large Language Model (LLM) operations, developed using TileLang. This project targets engineers and researchers seeking to maximize LLM training and inference performance by leveraging kernels that approach hardware limits for compute intensity and memory bandwidth. Its core benefit lies in delivering state-of-the-art performance through an agile development framework.

How It Works

The project utilizes TileLang, a domain-specific language embedded in Python, for expressing and automatically optimizing high-performance GPU kernels. This approach facilitates easy migration of existing kernels and enables agile development cycles. Key architectural choices focus on pushing compute intensity and memory bandwidth utilization towards hardware ceilings, incorporating specialized kernels for Mixture of Experts (MoE) routing, advanced quantization (FP8/FP4/E5M6), batched transposes, Engram gating, and Manifold HyperConnection (mHC) operations.

Quick Start & Requirements

  • Primary Install:
    • Development: pip install -e ".[dev]"
    • Release: pip install tile-kernels
  • Prerequisites: Python 3.10+, PyTorch 2.10+, TileLang 0.1.9+, NVIDIA SM90/SM100 GPU architecture, CUDA Toolkit 13.1+.
  • Testing: Utilizes pytest for correctness and benchmarking.
  • Documentation: No specific quick-start or documentation links are provided beyond the repository itself.

Highlighted Details

  • Mixture of Experts (MoE): Features kernels for top-k expert selection, gating, and routing.
  • Quantization: Implements per-token, per-block, and per-channel FP8/FP4/E5M6 casting, including fused SwiGLU operations.
  • Specialized Kernels: Includes batched transpose operations, Engram gating with fused RMSNorm and gradient reduction, and Manifold HyperConnection (mHC) kernels with Sinkhorn normalization.
  • High-Level Abstractions: Offers torch.autograd.Function wrappers to compose low-level kernels into trainable PyTorch layers.

Maintenance & Community

The project lists authors in its citation but provides no specific details regarding active maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

Released under the permissive MIT License, allowing for broad compatibility with commercial use and integration into closed-source projects.

Limitations & Caveats

The project explicitly states that current kernels "do not represent best practices" and are undergoing active improvement in code quality and documentation. Adoption requires specific, high-end NVIDIA hardware (SM90/SM100 GPUs) and a recent CUDA toolkit version.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
5
Star History
1,289 stars in the last 6 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.6%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 5 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.5%
4k
High-performance C++ LLM inference library
Created 3 years ago
Updated 5 days ago
Feedback? Help us improve.