FlashQLA  by QwenLM

Accelerate AI workloads with high-performance linear attention kernels

Created 1 month ago
515 stars

Top 60.4% on SourcePulse

GitHubView on GitHub
Project Summary

FlashQLA is a high-performance linear attention kernel library designed to accelerate GDN Chunked Prefill operations. It targets researchers and engineers working with large language models, particularly those focused on pretraining or edge-side agentic inference, offering significant speedups over existing Triton kernels on modern NVIDIA hardware. The library leverages TileLang for optimized kernel development, enabling substantial performance gains.

How It Works

FlashQLA builds upon TileLang to implement a highly optimized GDN Chunked Prefill kernel. Its core approach involves applying reasonable operator fusion and performance optimizations to both forward and backward passes. Key innovations include gate-driven automatic intra-card context parallelism, which enhances GPU SM utilization by exploiting the exponential decay property of the GDN gate. Additionally, it employs hardware-friendly algebraic reformulations to reduce computational overhead without sacrificing numerical precision, and utilizes TileLang to construct fused, warp-specialized kernels that effectively overlap data movement and computation.

Quick Start & Requirements

  • Requirements: NVIDIA SM90 or above, CUDA 12.8 or above, PyTorch 2.8 or above.
  • Installation:
    git clone https://github.com/QwenLM/FlashQLA.git
    cd FlashQLA
    pip install -v .
    
  • Links: Blog (Note: Blog link in README is a placeholder and may not lead to content).

Highlighted Details

  • Achieves 2-3× forward speedup and 2× backward speedup over the FLA Triton kernel on NVIDIA Hopper.
  • Efficiency gains are particularly pronounced in pretraining scenarios and edge-side agentic inference.
  • Features gate-driven automatic intra-card context parallelism for improved GPU SM utilization.
  • Employs hardware-friendly algebraic reformulation to minimize Tensor Core, CUDA Core, and SFU overhead.
  • Uses TileLang for fused warp-specialized kernels with manual warpgroup specialization.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README. The project is associated with QwenLM.

Licensing & Compatibility

FlashQLA is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The library has stringent hardware and software requirements, mandating NVIDIA SM90+ GPUs and recent versions of CUDA and PyTorch. These prerequisites may limit adoption for users with older hardware or different development environments.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
7
Star History
516 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.8%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 11 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.3%
24k
Fast, memory-efficient attention implementation
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.