SpargeAttn  by thu-ml

Training-free sparse attention for model inference acceleration

created 5 months ago
664 stars

Top 51.6% on sourcepulse

GitHubView on GitHub
Project Summary

SpargeAttn provides a training-free sparse attention mechanism designed to accelerate inference across various models, including language, image, and video generation. It targets researchers and engineers seeking to improve the efficiency of existing deep learning architectures without requiring model retraining.

How It Works

SpargeAttn implements a novel sparse attention mechanism that dynamically identifies and focuses on salient attention patterns. This approach reduces computational overhead by selectively computing attention scores, leading to significant speedups during inference. The implementation offers two variants, one based on SageAttention and an updated version based on SageAttention2, which claims a further 30% speedup.

Quick Start & Requirements

  • Install with pip install ninja followed by python setup.py install or pip install -e ..
  • Requires Python >= 3.9, PyTorch >= 2.3.0, and CUDA >= 12.0 (specific versions for FP8 support on Ada/Hopper/Blackwell).
  • Official documentation and examples for CogVideoX and Llama are available.

Highlighted Details

  • Accelerates inference for language, image, and video models.
  • Training-free, allowing direct application to pre-trained models.
  • Offers variants based on SageAttention and SageAttention2 for improved performance.
  • Supports CogVideoX-2b and want2v-1.3B models with tuned checkpoints available.

Maintenance & Community

The project welcomes contributions for supporting additional models. Links to Hugging Face for tuned checkpoints are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The README notes that provided hyper-parameters are tuned for the SageAttention variant, and re-tuning is recommended for optimal performance with the newer SageAttention2 API. The --compile flag can slow down the first inference pass.

Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
3
Issues (30d)
5
Star History
166 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.