Metal port of FlashAttention for Apple silicon
Top 62.0% on sourcepulse
This repository provides a Metal port of FlashAttention, optimized for Apple Silicon. It targets researchers and developers working with large language models on Apple hardware, offering a performant and memory-efficient implementation of the attention mechanism.
How It Works
The port focuses on single-headed attention, meticulously optimizing for Metal's architecture. It addresses register pressure bottlenecks through novel blocking strategies and intentional register spilling, achieving high ALU utilization. The backward pass is redesigned with higher compute cost but improved parallelization, avoiding problematic FP32 atomics emulation on Apple hardware.
Quick Start & Requirements
-Xswiftc -Ounchecked
for performance.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
10 months ago
1 day