SageAttention by thu-ml

Attention kernel for plug-and-play inference acceleration

Created 1 year ago

3,027 stars

Top 15.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Philipp Schmid

DevRel at Google DeepMind

Wing Lian

Founder of Axolotl AI

Lianmin Zheng

Coauthor of SGLang, vLLM

Project Summary

SageAttention provides highly optimized attention kernels for large language, image, and video models, achieving significant speedups (2-5x over FlashAttention, 3-11x over xformers) through quantization (INT8, FP8) and outlier smoothing without compromising end-to-end accuracy. It targets researchers and engineers seeking to accelerate inference on modern NVIDIA GPUs.

How It Works

SageAttention employs INT8 quantization for the $QK^\top$ operation and FP8 quantization for the $PV$ computation, utilizing a two-level accumulation strategy to maintain accuracy with lower precision. It offers optimized kernels for Ampere, Ada, and Hopper architectures, with specific optimizations for FP8 MMA and WGMMA. The approach prioritizes plug-and-play integration and supports torch.compile.

Quick Start & Requirements

Installation: pip install sageattention==1.0.6 for the Triton-only version. For SageAttention 2.1.1, clone the repository and run python setup.py install or pip install -e ..
Prerequisites: Python >= 3.9, PyTorch >= 2.3.0, Triton >= 3.0.0. CUDA >= 12.8 (Blackwell), >= 12.4 (FP8 on Ada), >= 12.3 (FP8 on Hopper), >= 12.0 (Ampere). FlashAttention3 must be compiled separately for benchmarking.
Resources: Compilation from source is required for the latest features and optimal performance.
Docs: SageAttention Paper, SageAttention2 Paper, Examples.

Highlighted Details

Achieves 2-5x speedup over FlashAttention and 3-11x over xformers without accuracy loss.
Supports INT8 quantization for $QK^\top$ and FP8 quantization for $PV$.
Optimized kernels for Ampere, Ada, and Hopper GPUs, with specific support for Blackwell.
Compatible with torch.compile and distributed inference.

Maintenance & Community

The project has been accepted to ICLR 2025 (Oral) and features recent updates including support for RTX5090 and SpargeAttn. The primary contributors are listed in the papers.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README notes that not all models are compatible with the F.scaled_dot_product_attention = sageattn replacement; direct modification of model attention classes is sometimes necessary. The latest versions require compilation from source.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

215 stars in the last 30 days