tilelang-ascend by tile-ai

High-performance AI kernel development for Huawei Ascend NPUs

Created 7 months ago

272 stars

Top 94.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

Summary

TileLang-Ascend is a specialized DSL built on TileLang and TVM, designed to generate high-performance AI compute kernels for Huawei Ascend NPU architectures. It enables developers to achieve state-of-the-art performance on Ascend processors by abstracting low-level hardware complexities while retaining control for optimization, targeting AI researchers and engineers.

How It Works

Leveraging a Pythonic syntax and TVM compiler infrastructure, TileLang-Ascend translates high-level kernel descriptions into optimized code for Ascend NPUs, supporting Ascend C & PTO and AscendNPU IR backends. The DSL facilitates efficient implementation of core AI operations like GEMM and attention, allowing developers to focus on algorithmic innovation rather than intricate hardware-specific programming.

Quick Start & Requirements

Installation: Recommended: pip install tilelang-*.whl. Alternatives include building from source via ./build_wheel_ascend.sh or install_ascend.sh.
Prerequisites: Ascend environment with CANN (>= 8.3.RC1) and torch-npu (>= 2.6.0.RC1) installed. Requires setting CANN environment variables.
Hardware: Tested on Huawei Ascend A2 and A3 NPUs.
Documentation: Programming Guide, Video Course Series.

Highlighted Details

Supports a wide range of operators: GEMM, Batch GEMM, Elementwise Operations, Flash Attention, Sparse Flash Attention, Softmax, Normalization, Activation Functions, Reduce, Sort, Convolution, and Cross Entropy Loss.
Features advanced optimization techniques: automatic synchronization, buffer reuse, software pipelining (T.Pipelined), automatic vectorization (T.Parallel), explicit scope management, and automatic workspace allocation.
Includes examples for PyTorch integration (torch_tl_ascend) and graph-level optimization (ACLGraph).
Recent updates include DeepSeek V4 kernels and Flash Attention optimization guides.

Maintenance & Community

Open-sourced on September 29, 2025. Acknowledges support from Huawei and Peking University. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

The README does not specify a software license, making it impossible to determine compatibility for commercial use or closed-source linking without further clarification.

Limitations & Caveats

Primarily tested on Ascend A2 and A3 NPUs. The T.Pipelined feature disallows nested loops. The absence of a specified license is a significant adoption blocker.

Health Check

Last Commit

7 hours ago

Responsiveness

Inactive

Pull Requests (30d)

205

Issues (30d)

Star History

35 stars in the last 30 days