FlashQLA by QwenLM

Accelerate AI workloads with high-performance linear attention kernels

Created 2 months ago

584 stars

Top 54.8% on SourcePulse

Project Summary

FlashQLA is a high-performance linear attention kernel library designed to accelerate GDN Chunked Prefill operations. It targets researchers and engineers working with large language models, particularly those focused on pretraining or edge-side agentic inference, offering significant speedups over existing Triton kernels on modern NVIDIA hardware. The library leverages TileLang for optimized kernel development, enabling substantial performance gains.

How It Works

FlashQLA builds upon TileLang to implement a highly optimized GDN Chunked Prefill kernel. Its core approach involves applying reasonable operator fusion and performance optimizations to both forward and backward passes. Key innovations include gate-driven automatic intra-card context parallelism, which enhances GPU SM utilization by exploiting the exponential decay property of the GDN gate. Additionally, it employs hardware-friendly algebraic reformulations to reduce computational overhead without sacrificing numerical precision, and utilizes TileLang to construct fused, warp-specialized kernels that effectively overlap data movement and computation.

Quick Start & Requirements

Requirements: NVIDIA SM90 or above, CUDA 12.8 or above, PyTorch 2.8 or above.

Installation:

git clone https://github.com/QwenLM/FlashQLA.git
cd FlashQLA
pip install -v .

Links: Blog (Note: Blog link in README is a placeholder and may not lead to content).

Highlighted Details

Achieves 2-3× forward speedup and 2× backward speedup over the FLA Triton kernel on NVIDIA Hopper.
Efficiency gains are particularly pronounced in pretraining scenarios and edge-side agentic inference.
Features gate-driven automatic intra-card context parallelism for improved GPU SM utilization.
Employs hardware-friendly algebraic reformulation to minimize Tensor Core, CUDA Core, and SFU overhead.
Uses TileLang for fused warp-specialized kernels with manual warpgroup specialization.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README. The project is associated with QwenLM.

Licensing & Compatibility

FlashQLA is released under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The library has stringent hardware and software requirements, mandating NVIDIA SM90+ GPUs and recent versions of CUDA and PyTorch. These prerequisites may limit adoption for users with older hardware or different development environments.

FlashQLA by QwenLM

Explore Similar Projects

mHC.cu by AndreSlavescu

varuna by microsoft

CUDA-L2 by deepreinforce-ai

torch-profiling-tutorial by Quentin-Anthony

MSA by MiniMax-AI

glake by antgroup

vLLM-2080Ti-Definitive by weicj

sonic-moe by Dao-AILab

DeepBench by baidu-research

lucebox by Luce-Org

fastllm by ztxz16

flash-attention by Dao-AILab