Block-Sparse-Attention by mit-han-lab

Efficient sparse attention kernels for LLMs

Created 1 year ago

430 stars

Top 69.0% on SourcePulse

Project Summary

This library provides optimized sparse attention kernels for Large Language Models (LLMs) to reduce computational and memory bandwidth demands, enabling efficient processing of longer prompts. It targets researchers and engineers working with LLMs who need to scale inference and handle complex inputs.

How It Works

Block Sparse Attention is a modification of FlashAttention 2.4.2, introducing support for various sparse attention patterns including dense, token-level streaming, block-level streaming, and block-sparse attention. A key innovation is the ability to assign different sparsity patterns to different attention heads within the same model, allowing for hybrid masking strategies that balance performance and accuracy.

Quick Start & Requirements

Install via pip install packaging ninja followed by python setup.py install.
Requires CUDA 11.6+, PyTorch 1.12+, and Linux.
Supports fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
Supports head dimensions of 32, 64, and 128.

Highlighted Details

Supports both forward and backward passes for block-sparse and block-streaming attention.
Enables hybrid masking by assigning different patterns to different heads (e.g., dense and streaming).
Demonstrates speedups over dense FlashAttention 2.4.2 on A100 GPUs.
Includes correctness and performance tests for various configurations.

Maintenance & Community

The project is associated with MIT and Nvidia researchers, including Song Han. It acknowledges inspiration from FlashAttention, Big Bird, ETC, StreamingLLM, Duo Attention, and MInference 1.0.

Licensing & Compatibility

The repository does not explicitly state a license. The project is built upon FlashAttention, which is Apache 2.0 licensed. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The token_streaming_attn_func currently does not support the backward pass. The project is based on a specific version of FlashAttention (2.4.2), and compatibility with newer versions is not guaranteed.

Block-Sparse-Attention by mit-han-lab

Explore Similar Projects

Flash-Sparse-Attention by Relaxed-System-Lab

native-sparse-attention-triton by XunhaoLai

flex-nano-vllm by changjonathanc

Star-Attention by NVIDIA

duo-attention by mit-han-lab

Kimi-Linear by MoonshotAI

MInference by microsoft

long-context-attention by feifeibear

DeepSeek-V3.2-Exp by deepseek-ai

flashinfer by flashinfer-ai

ChatGLM2-6B by zai-org

flash-attention by Dao-AILab