tilelang  by tile-ai

DSL for high-performance GPU/CPU kernel development (GEMM, attention, etc.)

Created 1 year ago
4,584 stars

Top 10.7% on SourcePulse

GitHubView on GitHub
Project Summary

Tile Language (tile-lang) is a domain-specific language (DSL) built on TVM for developing high-performance GPU and CPU kernels. It targets AI researchers and engineers seeking to optimize operations like GEMM, FlashAttention, and MLA decoding without sacrificing productivity, offering Pythonic syntax for low-level control.

How It Works

TileLang leverages TVM's compiler infrastructure to translate Python-like DSL code into optimized low-level kernels. It allows explicit control over tiling, data layout, pipelining, and parallelization, enabling developers to fine-tune performance for specific hardware architectures. This approach aims to bridge the gap between high-level productivity and the intricate optimizations required for state-of-the-art AI workloads.

Quick Start & Requirements

  • Install: pip install tilelang or pip install git+https://github.com/tile-ai/tilelang
  • Prerequisites: Python 3.x, GCC, python3-setuptools, cmake, libtinfo-dev, zlib1g-dev, build-essential, libedit-dev, libxml2-dev. CUDA 12.1+ for GPU targets.
  • Setup: Installation via pip is quick. Building from source or using nightly builds requires more setup.
  • Docs: https://tile-ai.github.io/tilelang/

Highlighted Details

  • Achieves performance parity with hand-optimized kernels for FlashMLA on AMD MI300X and MLA Decoding on H100.
  • Supports WebGPU codegen.
  • Includes debug tools like T.print and a memory layout plotter.
  • Tested on NVIDIA (H100, A100, V100, RTX 4090/3090/A6000) and AMD (MI250, MI300X) GPUs.

Maintenance & Community

  • Active development with recent updates including AMD MI300X support and MLA decoding.
  • Discord community available for discussion and support.
  • Used in projects like BitBLAS and AttentionEngine.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial and closed-source applications.

Limitations & Caveats

  • Nightly builds may be less stable.
  • The README mentions "dispatch to the cute/hip on Nvidia/AMD GPUs" for T.gemm, implying reliance on external libraries for the actual GEMM execution, which might introduce additional dependencies or compatibility considerations.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
179
Issues (30d)
94
Star History
450 stars in the last 30 days

Explore Similar Projects

Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
6k
Lecture series for GPU-accelerated computing
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
12k
Efficient CUDA kernels for MLA decoding
Created 10 months ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.