SageAttention  by thu-ml

Attention kernel for plug-and-play inference acceleration

Created 1 year ago
3,027 stars

Top 15.7% on SourcePulse

GitHubView on GitHub
Project Summary

SageAttention provides highly optimized attention kernels for large language, image, and video models, achieving significant speedups (2-5x over FlashAttention, 3-11x over xformers) through quantization (INT8, FP8) and outlier smoothing without compromising end-to-end accuracy. It targets researchers and engineers seeking to accelerate inference on modern NVIDIA GPUs.

How It Works

SageAttention employs INT8 quantization for the $QK^\top$ operation and FP8 quantization for the $PV$ computation, utilizing a two-level accumulation strategy to maintain accuracy with lower precision. It offers optimized kernels for Ampere, Ada, and Hopper architectures, with specific optimizations for FP8 MMA and WGMMA. The approach prioritizes plug-and-play integration and supports torch.compile.

Quick Start & Requirements

  • Installation: pip install sageattention==1.0.6 for the Triton-only version. For SageAttention 2.1.1, clone the repository and run python setup.py install or pip install -e ..
  • Prerequisites: Python >= 3.9, PyTorch >= 2.3.0, Triton >= 3.0.0. CUDA >= 12.8 (Blackwell), >= 12.4 (FP8 on Ada), >= 12.3 (FP8 on Hopper), >= 12.0 (Ampere). FlashAttention3 must be compiled separately for benchmarking.
  • Resources: Compilation from source is required for the latest features and optimal performance.
  • Docs: SageAttention Paper, SageAttention2 Paper, Examples.

Highlighted Details

  • Achieves 2-5x speedup over FlashAttention and 3-11x over xformers without accuracy loss.
  • Supports INT8 quantization for $QK^\top$ and FP8 quantization for $PV$.
  • Optimized kernels for Ampere, Ada, and Hopper GPUs, with specific support for Blackwell.
  • Compatible with torch.compile and distributed inference.

Maintenance & Community

The project has been accepted to ICLR 2025 (Oral) and features recent updates including support for RTX5090 and SpargeAttn. The primary contributors are listed in the papers.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README notes that not all models are compatible with the F.scaled_dot_product_attention = sageattn replacement; direct modification of model attention classes is sometimes necessary. The latest versions require compilation from source.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
10
Star History
215 stars in the last 30 days

Explore Similar Projects

Starred by Chris Lattner Chris Lattner(Author of LLVM, Clang, Swift, Mojo, MLIR; Cofounder of Modular), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
18 more.

open-infra-index by deepseek-ai

0.0%
8k
AI infrastructure tools for efficient AGI development
Created 10 months ago
Updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
14 more.

flashinfer by flashinfer-ai

3.5%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 10 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 22 hours ago
Feedback? Help us improve.