Flash-Sparse-Attention by Relaxed-System-Lab

Efficient Native Sparse Attention implementations

Created 4 months ago

1,044 stars

Top 36.0% on SourcePulse

Project Summary

This repository provides Flash Sparse Attention (FSA), an optimized implementation of Native Sparse Attention (NSA) designed to improve the efficiency of Large Language Models (LLMs) on modern GPUs. It targets researchers and engineers working with LLMs who need to accelerate attention mechanisms, particularly for models with smaller Grouped Query Attention (GQA) group sizes. FSA offers significant speedups by reducing memory access and computation through a novel kernel design.

How It Works

FSA optimizes the NSA selected attention module by altering the kernel loop order. Unlike NSA, which loops over query tokens in the outer loop, FSA loops over KV blocks in the outer loop. This approach decouples computation into three kernels: a main kernel for batching query tokens attending to the same KV block, a reduction kernel for accumulating attention results, and an online softmax kernel for statistics. This design minimizes unnecessary memory access and computations related to padding, and avoids atomic additions for accumulating results across KV blocks, leading to substantial performance gains.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: PyTorch >= 2.4, Triton >= 3.0, transformers >= 4.45.0, datasets >= 3.3.0, accelerate >= 1.9.0, flash-attn == 2.6.3.
Hardware: NVIDIA Ampere or Hopper GPUs (e.g., A100, H100), fp16/bf16 datatypes.
Setup: Requires specific library versions and compatible NVIDIA hardware.
Docs: The README provides usage examples and benchmarking scripts.

Highlighted Details

Offers an efficient Triton-based implementation for NSA, particularly effective for GQA group sizes smaller than 8.
Falls back to the original NSA implementation for GQA group sizes >= 8 for optimal performance.
Tested with various NVIDIA GPUs, datatypes (fp16, bf16), and sequence lengths.
Provides optimized kernels that accelerate the NSA selected attention module without altering the NSA algorithm itself.

Maintenance & Community

The project is relatively new, with its arXiv paper released in August 2025 and the open-sourcing in August 2025.
Upcoming features include an online profiling module.
A beta version of one-step decoding is available in fsa_preview.

Licensing & Compatibility

The repository is open-sourced, implying a permissive license, though the specific license type is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.

Limitations & Caveats

The implementation is primarily tested on NVIDIA Ampere or Hopper GPUs and may not perform optimally on other architectures.
For GQA group sizes of 8 or larger, the implementation defaults to the original NSA, suggesting potential limitations in FSA's advantage in those specific configurations.
The project appears to be in its early stages, with an upcoming paper and ongoing development.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days