SpargeAttn by thu-ml

Training-free sparse attention for model inference acceleration

Created 10 months ago

895 stars

Top 40.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

SpargeAttn provides a training-free sparse attention mechanism designed to accelerate inference across various models, including language, image, and video generation. It targets researchers and engineers seeking to improve the efficiency of existing deep learning architectures without requiring model retraining.

How It Works

SpargeAttn implements a novel sparse attention mechanism that dynamically identifies and focuses on salient attention patterns. This approach reduces computational overhead by selectively computing attention scores, leading to significant speedups during inference. The implementation offers two variants, one based on SageAttention and an updated version based on SageAttention2, which claims a further 30% speedup.

Quick Start & Requirements

Install with pip install ninja followed by python setup.py install or pip install -e ..
Requires Python >= 3.9, PyTorch >= 2.3.0, and CUDA >= 12.0 (specific versions for FP8 support on Ada/Hopper/Blackwell).
Official documentation and examples for CogVideoX and Llama are available.

Highlighted Details

Accelerates inference for language, image, and video models.
Training-free, allowing direct application to pre-trained models.
Offers variants based on SageAttention and SageAttention2 for improved performance.
Supports CogVideoX-2b and want2v-1.3B models with tuned checkpoints available.

Maintenance & Community

The project welcomes contributions for supporting additional models. Links to Hugging Face for tuned checkpoints are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The README notes that provided hyper-parameters are tuned for the SageAttention variant, and re-tuning is recommended for optimal performance with the newer SageAttention2 API. The --compile flag can slow down the first inference pass.

SpargeAttn by thu-ml

Explore Similar Projects

LaCT by a1600012888

VisionZip by JIA-Lab-research

Long-VITA by VITA-MLLM

finetune-Qwen2-VL by zhangfaen

BakLLaVA by SkunkworksAI

X-VLM by zengyan-97

lightning-thunder by Lightning-AI

FastVideo by hao-ai-lab

SageAttention by thu-ml

optimum by huggingface

prismatic-vlms by TRI-ML

openvino by openvinotoolkit