ring-flash-attention by zhuzilin

FlashAttention extension for ring attention

Created 1 year ago

961 stars

Top 38.3% on SourcePulse

View on GitHub

7 Experts Love This Project

Johannes Hagemann

Cofounder of Prime Intellect

Will Brown

Research Lead at Prime Intellect

Pawel Garbacki

Cofounder of Fireworks AI

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

and 3 more!

Project Summary

This repository provides implementations of Ring Attention, a technique for scaling attention mechanisms across multiple GPUs, integrated with FlashAttention for improved efficiency. It targets researchers and engineers working with large language models who need to overcome memory and computational bottlenecks during training and inference, offering optimized attention kernels that reduce memory overhead and increase throughput.

How It Works

Ring Attention distributes attention computation across a ring of GPUs, allowing for longer sequence lengths than would be possible on a single device. It leverages FlashAttention's optimized kernels for efficient computation of the attention mechanism. The project offers several variants, including a basic ring attention, a compute-balanced "zigzag" version, and a "llama3" context parallelism approach that is less intrusive for existing training frameworks.

Quick Start & Requirements

Install: pip install ring-flash-attn or build from source.
Prerequisites: PyTorch, CUDA. NVLink between GPUs is recommended for high performance.
Testing: torchrun --nproc_per_node 8 test/test_llama3_flash_attn_varlen_func.py (example for 8 GPUs).
Benchmarking: torchrun --nproc_per_node 8 benchmark/benchmark_kvpacked_func.py (example for 8 GPUs).
Documentation: https://github.com/zhuzilin/ring-flash-attention

Highlighted Details

Supports both batch and variable-length (varlen) sequence APIs.
Includes a Hugging Face model adapter for easier integration.
Benchmarks show Ring Attention achieving 50-90% of theoretical FlashAttention performance on 8x H800/A100 GPUs, depending on the variant and workload.
The llama3_flash_attn_varlen_func is recommended for varlen use cases due to its lower intrusion and better precision.

Maintenance & Community

The project is actively developed by zhuzilin. There are no explicit mentions of community channels (e.g., Discord/Slack) or formal roadmaps in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This is a critical omission for evaluating commercial use or integration into closed-source projects.

Limitations & Caveats

The implementation has known arithmetic errors, potentially due to bf16 precision in FlashAttention blocks and the need for extra fp32 buffers, leading to higher memory usage. Dropout and windowed attention are not supported due to difficulties in managing RNG states and implementation complexity.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days