Discover and explore top open-source AI tools and projects—updated daily.
Efficient Native Sparse Attention implementations
Top 38.7% on SourcePulse
This repository provides Flash Sparse Attention (FSA), an optimized implementation of Native Sparse Attention (NSA) designed to improve the efficiency of Large Language Models (LLMs) on modern GPUs. It targets researchers and engineers working with LLMs who need to accelerate attention mechanisms, particularly for models with smaller Grouped Query Attention (GQA) group sizes. FSA offers significant speedups by reducing memory access and computation through a novel kernel design.
How It Works
FSA optimizes the NSA selected attention module by altering the kernel loop order. Unlike NSA, which loops over query tokens in the outer loop, FSA loops over KV blocks in the outer loop. This approach decouples computation into three kernels: a main kernel for batching query tokens attending to the same KV block, a reduction kernel for accumulating attention results, and an online softmax kernel for statistics. This design minimizes unnecessary memory access and computations related to padding, and avoids atomic additions for accumulating results across KV blocks, leading to substantial performance gains.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
fsa_preview
.Licensing & Compatibility
Limitations & Caveats
2 weeks ago
Inactive