flash-attention  by Dao-AILab

Fast, memory-efficient attention implementation

created 3 years ago
18,633 stars

Top 2.5% on sourcepulse

GitHubView on GitHub
Project Summary

FlashAttention provides highly optimized implementations of the attention mechanism for deep learning models, addressing the quadratic memory and computational complexity of standard attention. It is designed for researchers and engineers building large-scale transformer models, offering significant speedups and memory savings.

How It Works

FlashAttention leverages IO-aware kernel fusion to reduce memory read/write operations between GPU high-bandwidth memory (HBM) and on-chip SRAM. By tiling attention computations and performing softmax within SRAM, it avoids materializing the large attention matrix, leading to substantial performance gains and reduced memory footprint.

Quick Start & Requirements

  • Installation: pip install flash-attn --no-build-isolation
  • Prerequisites:
    • CUDA toolkit (12.0+) or ROCm toolkit (6.0+)
    • PyTorch 2.2+
    • packaging and ninja Python packages
    • Linux (Windows support is experimental)
    • NVIDIA Ampere, Ada, or Hopper GPUs for FP16/BF16; Turing GPUs supported by FlashAttention 1.x.
    • AMD MI200 or MI300 GPUs for ROCm support.
  • Compilation: With ninja, compilation takes 3-5 minutes on a 64-core machine. Without ninja, it can take up to 2 hours.
  • FlashAttention-3 (Beta): Requires H100/H800 GPU and CUDA >= 12.3 (12.8 recommended). Installation: cd hopper && python setup.py install.
  • Docs: https://github.com/Dao-AILab/flash-attention

Highlighted Details

  • Achieves up to 2x speedup and 10-20x memory savings compared to standard PyTorch attention.
  • Supports causal attention, sliding window attention, multi-query/grouped-query attention (MQA/GQA), rotary embeddings, and ALiBi.
  • Includes flash_attn_with_kvcache for efficient inference with KV caching and paged KV cache support.
  • Offers ROCm support with both Composable Kernel (CK) and Triton backends.
  • FlashAttention-3 beta adds FP8 support for Hopper GPUs.

Maintenance & Community

The project is actively maintained by Tri Dao and the Dao-AILab. Discussions and support are available via GitHub Issues.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Windows compilation is experimental and may require further testing. Turing GPU support for FlashAttention-2 is pending; use 1.x for these. The Triton backend for AMD is still under active development with some features pending.

Health Check
Last commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)
17
Issues (30d)
41
Star History
1,556 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.