Flash-Sparse-Attention  by Relaxed-System-Lab

Efficient Native Sparse Attention implementations

Created 1 month ago
948 stars

Top 38.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides Flash Sparse Attention (FSA), an optimized implementation of Native Sparse Attention (NSA) designed to improve the efficiency of Large Language Models (LLMs) on modern GPUs. It targets researchers and engineers working with LLMs who need to accelerate attention mechanisms, particularly for models with smaller Grouped Query Attention (GQA) group sizes. FSA offers significant speedups by reducing memory access and computation through a novel kernel design.

How It Works

FSA optimizes the NSA selected attention module by altering the kernel loop order. Unlike NSA, which loops over query tokens in the outer loop, FSA loops over KV blocks in the outer loop. This approach decouples computation into three kernels: a main kernel for batching query tokens attending to the same KV block, a reduction kernel for accumulating attention results, and an online softmax kernel for statistics. This design minimizes unnecessary memory access and computations related to padding, and avoids atomic additions for accumulating results across KV blocks, leading to substantial performance gains.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: PyTorch >= 2.4, Triton >= 3.0, transformers >= 4.45.0, datasets >= 3.3.0, accelerate >= 1.9.0, flash-attn == 2.6.3.
  • Hardware: NVIDIA Ampere or Hopper GPUs (e.g., A100, H100), fp16/bf16 datatypes.
  • Setup: Requires specific library versions and compatible NVIDIA hardware.
  • Docs: The README provides usage examples and benchmarking scripts.

Highlighted Details

  • Offers an efficient Triton-based implementation for NSA, particularly effective for GQA group sizes smaller than 8.
  • Falls back to the original NSA implementation for GQA group sizes >= 8 for optimal performance.
  • Tested with various NVIDIA GPUs, datatypes (fp16, bf16), and sequence lengths.
  • Provides optimized kernels that accelerate the NSA selected attention module without altering the NSA algorithm itself.

Maintenance & Community

  • The project is relatively new, with its arXiv paper released in August 2025 and the open-sourcing in August 2025.
  • Upcoming features include an online profiling module.
  • A beta version of one-step decoding is available in fsa_preview.

Licensing & Compatibility

  • The repository is open-sourced, implying a permissive license, though the specific license type is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.

Limitations & Caveats

  • The implementation is primarily tested on NVIDIA Ampere or Hopper GPUs and may not perform optimally on other architectures.
  • For GQA group sizes of 8 or larger, the implementation defaults to the original NSA, suggesting potential limitations in FSA's advantage in those specific configurations.
  • The project appears to be in its early stages, with an upcoming paper and ongoing development.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
3
Star History
892 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

SageAttention by thu-ml

1.3%
2k
Attention kernel for plug-and-play inference acceleration
Created 11 months ago
Updated 1 month ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.