ffpa-attn by xlite-dev

Optimized attention for LLMs

Created 1 year ago

315 stars

Top 85.5% on SourcePulse

Project Summary

This project addresses the performance bottlenecks in attention mechanisms, particularly for large headdim values, by introducing FFPA (Faster Flash Prefill Attention). It offers a novel O(1) SRAM complexity solution, targeting researchers and engineers working with large transformer models who need to optimize inference and training speed. The primary benefit is a significant speedup, ranging from 1.8x to 3x, compared to standard implementations like SDPA EA, especially for demanding headdim configurations.

How It Works

FFPA extends FlashAttention-2 with a fine-grained tiling strategy at the MMA (Matrix Multiply-Accumulate) level, optimizing Q@K^T and P@V operations for large headdim (D > 256). This approach achieves O(1) SRAM complexity and O(d/4) or O(1) register complexity, enabling efficient processing beyond the typical limits of standard FlashAttention. FFPA is structured into three levels (L1-L3), each offering different trade-offs in register usage and recomputation while maintaining the same VRAM footprint. For smaller headdim, FFPA utilizes a coarse-grained tiling approach, delivering competitive performance.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/xlite-dev/ffpa-attn.git) and run bash .dev/install.sh, or build and install the wheel package (python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl).
Prerequisites: Python >= 3.10, PyTorch >= 2.4.0 (recommended 2.5.1), CUDA >= 12.4 (recommended 12.5), and flash-attention >= 2.6.3 (for testing).
Docker: nvcr.io/nvidia/pytorch:25.03-py3 is available.

Highlighted Details

Achieves 1.8x-3x speedups over SDPA EA for headdim > 256 on various NVIDIA GPUs (L20, A30, 3080, 4090).
Supports mixed-precision MMA accumulation, including QK F32 + PV F16, for enhanced performance.
Leverages pure MMA PTX instructions with advanced features like Split-Q, multi-stage processing, and collective stores.
Specifically targets headdim > 256, an area where standard FlashAttention-2 is not supported.

Maintenance & Community

Information regarding specific maintainers, community channels (e.g., Discord, Slack), or active development beyond the core contributors is not detailed in the provided README. Contributions are welcomed via pull requests.

Licensing & Compatibility

The project is licensed under the GNU General Public License v3.0 (GPL-3.0). This strong copyleft license requires derivative works to also be licensed under GPL-3.0, which may impose restrictions on its use in closed-source commercial products or when linking with libraries under incompatible licenses.

Limitations & Caveats

The primary advantage of FFPA is for headdim values exceeding 256; while it offers competitive performance for smaller dimensions, its unique benefits are realized in larger configurations. The GPL-3.0 license is a significant consideration for adoption in proprietary software. The project requires recent versions of CUDA and PyTorch.

ffpa-attn by xlite-dev

Explore Similar Projects

flex-nano-vllm by changjonathanc

MagicPIG by Infini-AI-Lab

MSA by MiniMax-AI

Block-Sparse-Attention by mit-han-lab

tiny-vllm by jmaczan

omniserve by mit-han-lab

LookaheadDecoding by hao-ai-lab

marlin by IST-DASLab

rtp-llm by alibaba

mini-sglang by sgl-project

flash-attention by Dao-AILab

unsloth by unslothai