ffpa-attn  by xlite-dev

Optimized attention for LLMs

Created 1 year ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project addresses the performance bottlenecks in attention mechanisms, particularly for large headdim values, by introducing FFPA (Faster Flash Prefill Attention). It offers a novel O(1) SRAM complexity solution, targeting researchers and engineers working with large transformer models who need to optimize inference and training speed. The primary benefit is a significant speedup, ranging from 1.8x to 3x, compared to standard implementations like SDPA EA, especially for demanding headdim configurations.

How It Works

FFPA extends FlashAttention-2 with a fine-grained tiling strategy at the MMA (Matrix Multiply-Accumulate) level, optimizing Q@K^T and P@V operations for large headdim (D > 256). This approach achieves O(1) SRAM complexity and O(d/4) or O(1) register complexity, enabling efficient processing beyond the typical limits of standard FlashAttention. FFPA is structured into three levels (L1-L3), each offering different trade-offs in register usage and recomputation while maintaining the same VRAM footprint. For smaller headdim, FFPA utilizes a coarse-grained tiling approach, delivering competitive performance.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/xlite-dev/ffpa-attn.git) and run bash .dev/install.sh, or build and install the wheel package (python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl).
  • Prerequisites: Python >= 3.10, PyTorch >= 2.4.0 (recommended 2.5.1), CUDA >= 12.4 (recommended 12.5), and flash-attention >= 2.6.3 (for testing).
  • Docker: nvcr.io/nvidia/pytorch:25.03-py3 is available.

Highlighted Details

  • Achieves 1.8x-3x speedups over SDPA EA for headdim > 256 on various NVIDIA GPUs (L20, A30, 3080, 4090).
  • Supports mixed-precision MMA accumulation, including QK F32 + PV F16, for enhanced performance.
  • Leverages pure MMA PTX instructions with advanced features like Split-Q, multi-stage processing, and collective stores.
  • Specifically targets headdim > 256, an area where standard FlashAttention-2 is not supported.

Maintenance & Community

Information regarding specific maintainers, community channels (e.g., Discord, Slack), or active development beyond the core contributors is not detailed in the provided README. Contributions are welcomed via pull requests.

Licensing & Compatibility

The project is licensed under the GNU General Public License v3.0 (GPL-3.0). This strong copyleft license requires derivative works to also be licensed under GPL-3.0, which may impose restrictions on its use in closed-source commercial products or when linking with libraries under incompatible licenses.

Limitations & Caveats

The primary advantage of FFPA is for headdim values exceeding 256; while it offers competitive performance for smaller dimensions, its unique benefits are realized in larger configurations. The GPL-3.0 license is a significant consideration for adoption in proprietary software. The project requires recent versions of CUDA and PyTorch.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 17 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.3%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 17 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
41 more.

unsloth by unslothai

0.7%
53k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 2 years ago
Updated 22 hours ago
Feedback? Help us improve.