long-context-attention  by feifeibear

Unified sequence parallel attention for long context LLM training/inference

Created 1 year ago
567 stars

Top 56.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides Unified Sequence Parallelism (USP), a novel attention mechanism designed to enable efficient training and inference of Large Language Models (LLMs) with long contexts. It addresses limitations of existing methods like DeepSpeed-Ulysses and Ring-Attention by synergizing their strengths, offering improved versatility and performance for researchers and engineers working with extended sequence lengths.

How It Works

USP combines DeepSpeed-Ulysses-Attention and Ring-Attention to overcome individual drawbacks. Ulysses is limited by head count and compatibility with Tensor Parallelism, while Ring-Attention is less efficient and prone to deadlocks. USP offers a unified approach, allowing for flexible configuration (e.g., "zigzag" or "stripe" for load balancing) and supporting various hardware backends, including those without FlashAttention via a PyTorch implementation.

Quick Start & Requirements

  • Installation: pip install yunchang (requires flash-attn 2.6.x or 2.7.x for GPU acceleration). FlashAttention V3 requires installation from source. A PyTorch-based implementation is available for hardware without FlashAttention (attn_type=AttnType.TORCH), though backward pass is not supported.
  • Dependencies: FlashAttention (v2 or v3), PyTorch. AMD GPU support is available via install_amd.md.
  • Usage: Integrate by setting process groups with set_seq_parallel_pg and using LongContextAttention as a drop-in replacement for standard attention layers. Examples and testing scripts are provided in the test/ directory.

Highlighted Details

  • Synergizes DeepSpeed-Ulysses and Ring-Attention for enhanced long-context LLM training.
  • Offers flexibility with different ring_impl_type ("zigzag", "stripe", "basic") and attn_type (FA, FA3, TORCH).
  • Verified for accuracy against Data Parallelism in Megatron-LM.
  • Benchmarks show competitive performance, with QKV-packed versions outperforming unpacked versions, especially at shorter sequence lengths.
  • Supports hybrid parallelism for heterogeneous network devices.

Maintenance & Community

The project has been integrated into several notable projects including NVIDIA/TransformerEngine, xdit-project/xDiT, and NVlabs/VILA, indicating active adoption and validation.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch-based attention implementation (AttnType.TORCH) does not support the backward pass. The "zigzag" and "stripe" implementations have specific sequence dimension layout requirements.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
3
Issues (30d)
1
Star History
19 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.2%
7k
Framework for training large-scale autoregressive language models
Created 4 years ago
Updated 2 days ago
Feedback? Help us improve.