Discover and explore top open-source AI tools and projects—updated daily.
Efficient sparse attention kernels for LLMs
Top 88.8% on SourcePulse
This library provides optimized sparse attention kernels for Large Language Models (LLMs) to reduce computational and memory bandwidth demands, enabling efficient processing of longer prompts. It targets researchers and engineers working with LLMs who need to scale inference and handle complex inputs.
How It Works
Block Sparse Attention is a modification of FlashAttention 2.4.2, introducing support for various sparse attention patterns including dense, token-level streaming, block-level streaming, and block-sparse attention. A key innovation is the ability to assign different sparsity patterns to different attention heads within the same model, allowing for hybrid masking strategies that balance performance and accuracy.
Quick Start & Requirements
pip install packaging ninja
followed by python setup.py install
.Highlighted Details
Maintenance & Community
The project is associated with MIT and Nvidia researchers, including Song Han. It acknowledges inspiration from FlashAttention, Big Bird, ETC, StreamingLLM, Duo Attention, and MInference 1.0.
Licensing & Compatibility
The repository does not explicitly state a license. The project is built upon FlashAttention, which is Apache 2.0 licensed. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The token_streaming_attn_func
currently does not support the backward pass. The project is based on a specific version of FlashAttention (2.4.2), and compatibility with newer versions is not guaranteed.
7 months ago
Inactive