metal-flash-attention  by philipturner

Metal port of FlashAttention for Apple silicon

created 2 years ago
511 stars

Top 62.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a Metal port of FlashAttention, optimized for Apple Silicon. It targets researchers and developers working with large language models on Apple hardware, offering a performant and memory-efficient implementation of the attention mechanism.

How It Works

The port focuses on single-headed attention, meticulously optimizing for Metal's architecture. It addresses register pressure bottlenecks through novel blocking strategies and intentional register spilling, achieving high ALU utilization. The backward pass is redesigned with higher compute cost but improved parallelization, avoiding problematic FP32 atomics emulation on Apple hardware.

Quick Start & Requirements

  • Install via Swift Package Manager or by cloning the repository.
  • Requires macOS and Xcode.
  • Compile with -Xswiftc -Ounchecked for performance.
  • See Usage for detailed setup.

Highlighted Details

  • Achieves 4400 gigainstructions per second on M1 Max.
  • Backward pass uses less memory than the official implementation.
  • Novel backward pass design with 100% parallelization efficiency.
  • Optimized register spilling for large head dimensions.

Maintenance & Community

  • Maintained by philipturner.
  • No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Currently supports only single-headed attention.
  • BF16 emulation is used, which may incur overhead on older chips.
  • Benchmarks suggest potential issues with large head dimensions (D=256) on some Nvidia hardware, though this port aims to address such limitations on Apple Silicon.
Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 18 hours ago
Feedback? Help us improve.