Fast, memory-efficient attention implementation
Top 2.5% on sourcepulse
FlashAttention provides highly optimized implementations of the attention mechanism for deep learning models, addressing the quadratic memory and computational complexity of standard attention. It is designed for researchers and engineers building large-scale transformer models, offering significant speedups and memory savings.
How It Works
FlashAttention leverages IO-aware kernel fusion to reduce memory read/write operations between GPU high-bandwidth memory (HBM) and on-chip SRAM. By tiling attention computations and performing softmax within SRAM, it avoids materializing the large attention matrix, leading to substantial performance gains and reduced memory footprint.
Quick Start & Requirements
pip install flash-attn --no-build-isolation
packaging
and ninja
Python packagesninja
, compilation takes 3-5 minutes on a 64-core machine. Without ninja
, it can take up to 2 hours.cd hopper && python setup.py install
.Highlighted Details
flash_attn_with_kvcache
for efficient inference with KV caching and paged KV cache support.Maintenance & Community
The project is actively maintained by Tri Dao and the Dao-AILab. Discussions and support are available via GitHub Issues.
Licensing & Compatibility
Limitations & Caveats
Windows compilation is experimental and may require further testing. Turing GPU support for FlashAttention-2 is pending; use 1.x for these. The Triton backend for AMD is still under active development with some features pending.
14 hours ago
1 day