flash-attention by Dao-AILab

Fast, memory-efficient attention implementation

Created 3 years ago

21,536 stars

Top 2.0% on SourcePulse

View on GitHub

36 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Jiayi Pan

Author of SWE-Gym; MTS at xAI

and 32 more!

Project Summary

FlashAttention provides highly optimized implementations of the attention mechanism for deep learning models, addressing the quadratic memory and computational complexity of standard attention. It is designed for researchers and engineers building large-scale transformer models, offering significant speedups and memory savings.

How It Works

FlashAttention leverages IO-aware kernel fusion to reduce memory read/write operations between GPU high-bandwidth memory (HBM) and on-chip SRAM. By tiling attention computations and performing softmax within SRAM, it avoids materializing the large attention matrix, leading to substantial performance gains and reduced memory footprint.

Quick Start & Requirements

Installation: pip install flash-attn --no-build-isolation
Prerequisites:
- CUDA toolkit (12.0+) or ROCm toolkit (6.0+)
- PyTorch 2.2+
- packaging and ninja Python packages
- Linux (Windows support is experimental)
- NVIDIA Ampere, Ada, or Hopper GPUs for FP16/BF16; Turing GPUs supported by FlashAttention 1.x.
- AMD MI200 or MI300 GPUs for ROCm support.
Compilation: With ninja, compilation takes 3-5 minutes on a 64-core machine. Without ninja, it can take up to 2 hours.
FlashAttention-3 (Beta): Requires H100/H800 GPU and CUDA >= 12.3 (12.8 recommended). Installation: cd hopper && python setup.py install.
Docs: https://github.com/Dao-AILab/flash-attention

Highlighted Details

Achieves up to 2x speedup and 10-20x memory savings compared to standard PyTorch attention.
Supports causal attention, sliding window attention, multi-query/grouped-query attention (MQA/GQA), rotary embeddings, and ALiBi.
Includes flash_attn_with_kvcache for efficient inference with KV caching and paged KV cache support.
Offers ROCm support with both Composable Kernel (CK) and Triton backends.
FlashAttention-3 beta adds FP8 support for Hopper GPUs.

Maintenance & Community

The project is actively maintained by Tri Dao and the Dao-AILab. Discussions and support are available via GitHub Issues.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Windows compilation is experimental and may require further testing. Turing GPU support for FlashAttention-2 is pending; use 1.x for these. The Triton backend for AMD is still under active development with some features pending.

Health Check

Last Commit

22 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

513 stars in the last 30 days