tilelang-ascend  by tile-ai

High-performance AI kernel development for Huawei Ascend NPUs

Created 7 months ago
272 stars

Top 94.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

TileLang-Ascend is a specialized DSL built on TileLang and TVM, designed to generate high-performance AI compute kernels for Huawei Ascend NPU architectures. It enables developers to achieve state-of-the-art performance on Ascend processors by abstracting low-level hardware complexities while retaining control for optimization, targeting AI researchers and engineers.

How It Works

Leveraging a Pythonic syntax and TVM compiler infrastructure, TileLang-Ascend translates high-level kernel descriptions into optimized code for Ascend NPUs, supporting Ascend C & PTO and AscendNPU IR backends. The DSL facilitates efficient implementation of core AI operations like GEMM and attention, allowing developers to focus on algorithmic innovation rather than intricate hardware-specific programming.

Quick Start & Requirements

  • Installation: Recommended: pip install tilelang-*.whl. Alternatives include building from source via ./build_wheel_ascend.sh or install_ascend.sh.
  • Prerequisites: Ascend environment with CANN (>= 8.3.RC1) and torch-npu (>= 2.6.0.RC1) installed. Requires setting CANN environment variables.
  • Hardware: Tested on Huawei Ascend A2 and A3 NPUs.
  • Documentation: Programming Guide, Video Course Series.

Highlighted Details

  • Supports a wide range of operators: GEMM, Batch GEMM, Elementwise Operations, Flash Attention, Sparse Flash Attention, Softmax, Normalization, Activation Functions, Reduce, Sort, Convolution, and Cross Entropy Loss.
  • Features advanced optimization techniques: automatic synchronization, buffer reuse, software pipelining (T.Pipelined), automatic vectorization (T.Parallel), explicit scope management, and automatic workspace allocation.
  • Includes examples for PyTorch integration (torch_tl_ascend) and graph-level optimization (ACLGraph).
  • Recent updates include DeepSeek V4 kernels and Flash Attention optimization guides.

Maintenance & Community

Open-sourced on September 29, 2025. Acknowledges support from Huawei and Peking University. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

The README does not specify a software license, making it impossible to determine compatibility for commercial use or closed-source linking without further clarification.

Limitations & Caveats

Primarily tested on Ascend A2 and A3 NPUs. The T.Pipelined feature disallows nested loops. The absence of a specified license is a significant adoption blocker.

Health Check
Last Commit

7 hours ago

Responsiveness

Inactive

Pull Requests (30d)
205
Issues (30d)
23
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.4%
24k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 9 hours ago
Feedback? Help us improve.