SageAttention  by thu-ml

Attention kernel for plug-and-play inference acceleration

created 10 months ago
2,113 stars

Top 21.7% on sourcepulse

GitHubView on GitHub
Project Summary

SageAttention provides highly optimized attention kernels for large language, image, and video models, achieving significant speedups (2-5x over FlashAttention, 3-11x over xformers) through quantization (INT8, FP8) and outlier smoothing without compromising end-to-end accuracy. It targets researchers and engineers seeking to accelerate inference on modern NVIDIA GPUs.

How It Works

SageAttention employs INT8 quantization for the $QK^\top$ operation and FP8 quantization for the $PV$ computation, utilizing a two-level accumulation strategy to maintain accuracy with lower precision. It offers optimized kernels for Ampere, Ada, and Hopper architectures, with specific optimizations for FP8 MMA and WGMMA. The approach prioritizes plug-and-play integration and supports torch.compile.

Quick Start & Requirements

  • Installation: pip install sageattention==1.0.6 for the Triton-only version. For SageAttention 2.1.1, clone the repository and run python setup.py install or pip install -e ..
  • Prerequisites: Python >= 3.9, PyTorch >= 2.3.0, Triton >= 3.0.0. CUDA >= 12.8 (Blackwell), >= 12.4 (FP8 on Ada), >= 12.3 (FP8 on Hopper), >= 12.0 (Ampere). FlashAttention3 must be compiled separately for benchmarking.
  • Resources: Compilation from source is required for the latest features and optimal performance.
  • Docs: SageAttention Paper, SageAttention2 Paper, Examples.

Highlighted Details

  • Achieves 2-5x speedup over FlashAttention and 3-11x over xformers without accuracy loss.
  • Supports INT8 quantization for $QK^\top$ and FP8 quantization for $PV$.
  • Optimized kernels for Ampere, Ada, and Hopper GPUs, with specific support for Blackwell.
  • Compatible with torch.compile and distributed inference.

Maintenance & Community

The project has been accepted to ICLR 2025 (Oral) and features recent updates including support for RTX5090 and SpargeAttn. The primary contributors are listed in the papers.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README notes that not all models are compatible with the F.scaled_dot_product_attention = sageattn replacement; direct modification of model attention classes is sometimes necessary. The latest versions require compilation from source.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
17
Star History
698 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 19 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.