Block-Sparse-Attention  by mit-han-lab

Efficient sparse attention kernels for LLMs

Created 11 months ago
300 stars

Top 88.8% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides optimized sparse attention kernels for Large Language Models (LLMs) to reduce computational and memory bandwidth demands, enabling efficient processing of longer prompts. It targets researchers and engineers working with LLMs who need to scale inference and handle complex inputs.

How It Works

Block Sparse Attention is a modification of FlashAttention 2.4.2, introducing support for various sparse attention patterns including dense, token-level streaming, block-level streaming, and block-sparse attention. A key innovation is the ability to assign different sparsity patterns to different attention heads within the same model, allowing for hybrid masking strategies that balance performance and accuracy.

Quick Start & Requirements

  • Install via pip install packaging ninja followed by python setup.py install.
  • Requires CUDA 11.6+, PyTorch 1.12+, and Linux.
  • Supports fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
  • Supports head dimensions of 32, 64, and 128.

Highlighted Details

  • Supports both forward and backward passes for block-sparse and block-streaming attention.
  • Enables hybrid masking by assigning different patterns to different heads (e.g., dense and streaming).
  • Demonstrates speedups over dense FlashAttention 2.4.2 on A100 GPUs.
  • Includes correctness and performance tests for various configurations.

Maintenance & Community

The project is associated with MIT and Nvidia researchers, including Song Han. It acknowledges inspiration from FlashAttention, Big Bird, ETC, StreamingLLM, Duo Attention, and MInference 1.0.

Licensing & Compatibility

The repository does not explicitly state a license. The project is built upon FlashAttention, which is Apache 2.0 licensed. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The token_streaming_attn_func currently does not support the backward pass. The project is based on a specific version of FlashAttention (2.4.2), and compatibility with newer versions is not guaranteed.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.