native-sparse-attention-triton by XunhaoLai

Efficient sparse attention for LLMs

Created 1 year ago

265 stars

Top 96.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides an efficient Triton implementation of the Native Sparse Attention mechanism, designed for both training and inference of large language models. It addresses the computational bottlenecks of standard attention by introducing hardware-aligned sparsity, offering potential speedups and memory savings for researchers and power users.

How It Works

The project leverages Triton kernels for optimized computation of sparse attention. It implements a variable-length approach supporting prefilling, decoding, and KV cache management, similar to FlashAttention's varlen API. Key operations include linear_compress for key/value compression and compressed_attention followed by topk_sparse_attention to selectively compute attention scores, reducing computational complexity.

Quick Start & Requirements

Install: pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git
Dependencies: PyTorch >= 2.1.0, triton >= 3.0.0, einops >= 0.7.0, flash_attn >= 2.6.3.
Hardware: Requires NVIDIA GPU with CUDA support for Triton kernels. Benchmarks are provided for A100 and H100 GPUs.
Usage: Batch inputs must be concatenated before use due to the varlen approach.

Highlighted Details

Supports end-to-end training and inference for Native Sparse Attention.
Offers both low-level ops functions and higher-level nn.Module implementations.
Includes simplified LLaMA models (ToyNSALlama) for integration examples.
Extensive benchmarks demonstrate significant speed advantages over FlashAttention, particularly at larger sequence lengths on A100/H100 hardware.

Maintenance & Community

Contributions are welcomed via GitHub issues for discussing changes.
Direct contact for questions/feedback is available via laixunhao@pku.edu.cn.
No specific community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

The repository README does not specify a software license. This absence requires clarification for adoption decisions, especially regarding commercial use or derivative works.

Limitations & Caveats

Currently limited to attention head dimensions less than 128.
PyTorch operator implementations are designated for debugging only; Triton operators are recommended for production.
Requires pre-concatenation of batch inputs.
Absence of explicit licensing information.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days