ring-attention-pytorch by lucidrains

Pytorch impl of Ring Attention for near-infinite context

Created 1 year ago

549 stars

Top 58.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Alex Yu

Research Scientist at OpenAI; Cofounder of Luma AI

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Vincent Weisser

Cofounder of Prime Intellect

Project Summary

This repository provides PyTorch implementations of Ring Attention and Striped Attention, designed to enable transformers to process significantly longer sequences (millions of tokens) by sharding attention computations across multiple GPUs. It also includes Grouped Query Attention for further communication cost reduction, targeting researchers and engineers working with large-scale language models.

How It Works

Ring Attention splits the sequence dimension across GPUs, processing attention tiles in a ring-like fashion to minimize communication overhead. Striped Attention builds on this by permuting the sequence for improved workload balancing in autoregressive transformers. The implementation leverages Flash Attention for efficiency and Triton for CUDA kernels, optimizing both forward and backward passes.

Quick Start & Requirements

Install: pip install ring-attention-pytorch
Requirements: Python, PyTorch, CUDA (for GPU acceleration).
Testing: pip install -r requirements.txt then run python assert.py or python assert_tree_attn.py.

Highlighted Details

Implements Ring Attention, Striped Attention, and Tree Attention Decoding.
Supports Grouped Query Attention for reduced communication.
Leverages Flash Attention and Triton for CUDA kernel efficiency.
Designed for handling millions of tokens.

Maintenance & Community

The project is sponsored by the A16Z Open Source AI Grant Program. It acknowledges contributions from Tri Dao (Flash Attention) and Phil Tillet (Triton).

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The "Todo" list indicates ongoing development, with several features and optimizations still pending, including distributed PyTorch testing and specific dataset sharding strategies for training. Some CUDA kernel optimizations are noted as "hacks" or requiring further validation.

Health Check

Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days