Discover and explore top open-source AI tools and projects—updated daily.
XunhaoLaiEfficient sparse attention for LLMs
Top 98.1% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides an efficient Triton implementation of the Native Sparse Attention mechanism, designed for both training and inference of large language models. It addresses the computational bottlenecks of standard attention by introducing hardware-aligned sparsity, offering potential speedups and memory savings for researchers and power users.
How It Works
The project leverages Triton kernels for optimized computation of sparse attention. It implements a variable-length approach supporting prefilling, decoding, and KV cache management, similar to FlashAttention's varlen API. Key operations include linear_compress for key/value compression and compressed_attention followed by topk_sparse_attention to selectively compute attention scores, reducing computational complexity.
Quick Start & Requirements
pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.gitHighlighted Details
ops functions and higher-level nn.Module implementations.ToyNSALlama) for integration examples.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
7 months ago
Inactive
microsoft
deepseek-ai
thu-ml
flashinfer-ai
Dao-AILab