ring-flash-attention  by zhuzilin

FlashAttention extension for ring attention

created 1 year ago
827 stars

Top 43.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides implementations of Ring Attention, a technique for scaling attention mechanisms across multiple GPUs, integrated with FlashAttention for improved efficiency. It targets researchers and engineers working with large language models who need to overcome memory and computational bottlenecks during training and inference, offering optimized attention kernels that reduce memory overhead and increase throughput.

How It Works

Ring Attention distributes attention computation across a ring of GPUs, allowing for longer sequence lengths than would be possible on a single device. It leverages FlashAttention's optimized kernels for efficient computation of the attention mechanism. The project offers several variants, including a basic ring attention, a compute-balanced "zigzag" version, and a "llama3" context parallelism approach that is less intrusive for existing training frameworks.

Quick Start & Requirements

  • Install: pip install ring-flash-attn or build from source.
  • Prerequisites: PyTorch, CUDA. NVLink between GPUs is recommended for high performance.
  • Testing: torchrun --nproc_per_node 8 test/test_llama3_flash_attn_varlen_func.py (example for 8 GPUs).
  • Benchmarking: torchrun --nproc_per_node 8 benchmark/benchmark_kvpacked_func.py (example for 8 GPUs).
  • Documentation: https://github.com/zhuzilin/ring-flash-attention

Highlighted Details

  • Supports both batch and variable-length (varlen) sequence APIs.
  • Includes a Hugging Face model adapter for easier integration.
  • Benchmarks show Ring Attention achieving 50-90% of theoretical FlashAttention performance on 8x H800/A100 GPUs, depending on the variant and workload.
  • The llama3_flash_attn_varlen_func is recommended for varlen use cases due to its lower intrusion and better precision.

Maintenance & Community

The project is actively developed by zhuzilin. There are no explicit mentions of community channels (e.g., Discord/Slack) or formal roadmaps in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This is a critical omission for evaluating commercial use or integration into closed-source projects.

Limitations & Caveats

The implementation has known arithmetic errors, potentially due to bf16 precision in FlashAttention blocks and the need for extra fp32 buffers, leading to higher memory usage. Dropout and windowed attention are not supported due to difficulties in managing RNG states and implementation complexity.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
78 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 18 hours ago
Feedback? Help us improve.