Attention kernel for plug-and-play inference acceleration
Top 21.7% on sourcepulse
SageAttention provides highly optimized attention kernels for large language, image, and video models, achieving significant speedups (2-5x over FlashAttention, 3-11x over xformers) through quantization (INT8, FP8) and outlier smoothing without compromising end-to-end accuracy. It targets researchers and engineers seeking to accelerate inference on modern NVIDIA GPUs.
How It Works
SageAttention employs INT8 quantization for the $QK^\top$ operation and FP8 quantization for the $PV$ computation, utilizing a two-level accumulation strategy to maintain accuracy with lower precision. It offers optimized kernels for Ampere, Ada, and Hopper architectures, with specific optimizations for FP8 MMA and WGMMA. The approach prioritizes plug-and-play integration and supports torch.compile
.
Quick Start & Requirements
pip install sageattention==1.0.6
for the Triton-only version. For SageAttention 2.1.1, clone the repository and run python setup.py install
or pip install -e .
.Highlighted Details
torch.compile
and distributed inference.Maintenance & Community
The project has been accepted to ICLR 2025 (Oral) and features recent updates including support for RTX5090 and SpargeAttn. The primary contributors are listed in the papers.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README notes that not all models are compatible with the F.scaled_dot_product_attention = sageattn
replacement; direct modification of model attention classes is sometimes necessary. The latest versions require compilation from source.
1 week ago
1 day