Discover and explore top open-source AI tools and projects—updated daily.
xlite-devOptimized attention for LLMs
Top 99.8% on SourcePulse
This project addresses the performance bottlenecks in attention mechanisms, particularly for large headdim values, by introducing FFPA (Faster Flash Prefill Attention). It offers a novel O(1) SRAM complexity solution, targeting researchers and engineers working with large transformer models who need to optimize inference and training speed. The primary benefit is a significant speedup, ranging from 1.8x to 3x, compared to standard implementations like SDPA EA, especially for demanding headdim configurations.
How It Works
FFPA extends FlashAttention-2 with a fine-grained tiling strategy at the MMA (Matrix Multiply-Accumulate) level, optimizing Q@K^T and P@V operations for large headdim (D > 256). This approach achieves O(1) SRAM complexity and O(d/4) or O(1) register complexity, enabling efficient processing beyond the typical limits of standard FlashAttention. FFPA is structured into three levels (L1-L3), each offering different trade-offs in register usage and recomputation while maintaining the same VRAM footprint. For smaller headdim, FFPA utilizes a coarse-grained tiling approach, delivering competitive performance.
Quick Start & Requirements
git clone https://github.com/xlite-dev/ffpa-attn.git) and run bash .dev/install.sh, or build and install the wheel package (python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl).flash-attention >= 2.6.3 (for testing).nvcr.io/nvidia/pytorch:25.03-py3 is available.Highlighted Details
headdim > 256 on various NVIDIA GPUs (L20, A30, 3080, 4090).headdim > 256, an area where standard FlashAttention-2 is not supported.Maintenance & Community
Information regarding specific maintainers, community channels (e.g., Discord, Slack), or active development beyond the core contributors is not detailed in the provided README. Contributions are welcomed via pull requests.
Licensing & Compatibility
The project is licensed under the GNU General Public License v3.0 (GPL-3.0). This strong copyleft license requires derivative works to also be licensed under GPL-3.0, which may impose restrictions on its use in closed-source commercial products or when linking with libraries under incompatible licenses.
Limitations & Caveats
The primary advantage of FFPA is for headdim values exceeding 256; while it offers competitive performance for smaller dimensions, its unique benefits are realized in larger configurations. The GPL-3.0 license is a significant consideration for adoption in proprietary software. The project requires recent versions of CUDA and PyTorch.
1 week ago
Inactive
hao-ai-lab
alibaba
Dao-AILab
unslothai