native-sparse-attention-triton  by XunhaoLai

Efficient sparse attention for LLMs

Created 10 months ago
258 stars

Top 98.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides an efficient Triton implementation of the Native Sparse Attention mechanism, designed for both training and inference of large language models. It addresses the computational bottlenecks of standard attention by introducing hardware-aligned sparsity, offering potential speedups and memory savings for researchers and power users.

How It Works

The project leverages Triton kernels for optimized computation of sparse attention. It implements a variable-length approach supporting prefilling, decoding, and KV cache management, similar to FlashAttention's varlen API. Key operations include linear_compress for key/value compression and compressed_attention followed by topk_sparse_attention to selectively compute attention scores, reducing computational complexity.

Quick Start & Requirements

  • Install: pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git
  • Dependencies: PyTorch >= 2.1.0, triton >= 3.0.0, einops >= 0.7.0, flash_attn >= 2.6.3.
  • Hardware: Requires NVIDIA GPU with CUDA support for Triton kernels. Benchmarks are provided for A100 and H100 GPUs.
  • Usage: Batch inputs must be concatenated before use due to the varlen approach.

Highlighted Details

  • Supports end-to-end training and inference for Native Sparse Attention.
  • Offers both low-level ops functions and higher-level nn.Module implementations.
  • Includes simplified LLaMA models (ToyNSALlama) for integration examples.
  • Extensive benchmarks demonstrate significant speed advantages over FlashAttention, particularly at larger sequence lengths on A100/H100 hardware.

Maintenance & Community

  • Contributions are welcomed via GitHub issues for discussing changes.
  • Direct contact for questions/feedback is available via laixunhao@pku.edu.cn.
  • No specific community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

  • The repository README does not specify a software license. This absence requires clarification for adoption decisions, especially regarding commercial use or derivative works.

Limitations & Caveats

  • Currently limited to attention head dimensions less than 128.
  • PyTorch operator implementations are designated for debugging only; Triton operators are recommended for production.
  • Requires pre-concatenation of batch inputs.
  • Absence of explicit licensing information.
Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

DeepSeek-V3.2-Exp by deepseek-ai

1.0%
1k
Experimental LLM boosting long-context efficiency
Created 3 months ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

SageAttention by thu-ml

1.3%
3k
Attention kernel for plug-and-play inference acceleration
Created 1 year ago
Updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
14 more.

flashinfer by flashinfer-ai

3.5%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 18 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.