ring-flash-attention  by zhuzilin

FlashAttention extension for ring attention

Created 1 year ago
869 stars

Top 41.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides implementations of Ring Attention, a technique for scaling attention mechanisms across multiple GPUs, integrated with FlashAttention for improved efficiency. It targets researchers and engineers working with large language models who need to overcome memory and computational bottlenecks during training and inference, offering optimized attention kernels that reduce memory overhead and increase throughput.

How It Works

Ring Attention distributes attention computation across a ring of GPUs, allowing for longer sequence lengths than would be possible on a single device. It leverages FlashAttention's optimized kernels for efficient computation of the attention mechanism. The project offers several variants, including a basic ring attention, a compute-balanced "zigzag" version, and a "llama3" context parallelism approach that is less intrusive for existing training frameworks.

Quick Start & Requirements

  • Install: pip install ring-flash-attn or build from source.
  • Prerequisites: PyTorch, CUDA. NVLink between GPUs is recommended for high performance.
  • Testing: torchrun --nproc_per_node 8 test/test_llama3_flash_attn_varlen_func.py (example for 8 GPUs).
  • Benchmarking: torchrun --nproc_per_node 8 benchmark/benchmark_kvpacked_func.py (example for 8 GPUs).
  • Documentation: https://github.com/zhuzilin/ring-flash-attention

Highlighted Details

  • Supports both batch and variable-length (varlen) sequence APIs.
  • Includes a Hugging Face model adapter for easier integration.
  • Benchmarks show Ring Attention achieving 50-90% of theoretical FlashAttention performance on 8x H800/A100 GPUs, depending on the variant and workload.
  • The llama3_flash_attn_varlen_func is recommended for varlen use cases due to its lower intrusion and better precision.

Maintenance & Community

The project is actively developed by zhuzilin. There are no explicit mentions of community channels (e.g., Discord/Slack) or formal roadmaps in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This is a critical omission for evaluating commercial use or integration into closed-source projects.

Limitations & Caveats

The implementation has known arithmetic errors, potentially due to bf16 precision in FlashAttention blocks and the need for extra fp32 buffers, leading to higher memory usage. Dropout and windowed attention are not supported due to difficulties in managing RNG states and implementation complexity.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
0
Star History
28 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.