long-context-attention  by feifeibear

Unified sequence parallel attention for long context LLM training/inference

Created 1 year ago
621 stars

Top 53.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides Unified Sequence Parallelism (USP), a novel attention mechanism designed to enable efficient training and inference of Large Language Models (LLMs) with long contexts. It addresses limitations of existing methods like DeepSpeed-Ulysses and Ring-Attention by synergizing their strengths, offering improved versatility and performance for researchers and engineers working with extended sequence lengths.

How It Works

USP combines DeepSpeed-Ulysses-Attention and Ring-Attention to overcome individual drawbacks. Ulysses is limited by head count and compatibility with Tensor Parallelism, while Ring-Attention is less efficient and prone to deadlocks. USP offers a unified approach, allowing for flexible configuration (e.g., "zigzag" or "stripe" for load balancing) and supporting various hardware backends, including those without FlashAttention via a PyTorch implementation.

Quick Start & Requirements

  • Installation: pip install yunchang (requires flash-attn 2.6.x or 2.7.x for GPU acceleration). FlashAttention V3 requires installation from source. A PyTorch-based implementation is available for hardware without FlashAttention (attn_type=AttnType.TORCH), though backward pass is not supported.
  • Dependencies: FlashAttention (v2 or v3), PyTorch. AMD GPU support is available via install_amd.md.
  • Usage: Integrate by setting process groups with set_seq_parallel_pg and using LongContextAttention as a drop-in replacement for standard attention layers. Examples and testing scripts are provided in the test/ directory.

Highlighted Details

  • Synergizes DeepSpeed-Ulysses and Ring-Attention for enhanced long-context LLM training.
  • Offers flexibility with different ring_impl_type ("zigzag", "stripe", "basic") and attn_type (FA, FA3, TORCH).
  • Verified for accuracy against Data Parallelism in Megatron-LM.
  • Benchmarks show competitive performance, with QKV-packed versions outperforming unpacked versions, especially at shorter sequence lengths.
  • Supports hybrid parallelism for heterogeneous network devices.

Maintenance & Community

The project has been integrated into several notable projects including NVIDIA/TransformerEngine, xdit-project/xDiT, and NVlabs/VILA, indicating active adoption and validation.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch-based attention implementation (AttnType.TORCH) does not support the backward pass. The "zigzag" and "stripe" implementations have specific sequence dimension layout requirements.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
14 more.

flashinfer by flashinfer-ai

3.5%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 15 hours ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.1%
7k
Framework for training large-scale autoregressive language models
Created 5 years ago
Updated 1 month ago
Feedback? Help us improve.