long-context-attention by feifeibear

Unified sequence parallel attention for long context LLM training/inference

Created 1 year ago

621 stars

Top 53.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Yaowei Zheng

Author of LLaMA-Factory

Wing Lian

Founder of Axolotl AI

Jiaming Song

Chief Scientist at Luma AI

Project Summary

This repository provides Unified Sequence Parallelism (USP), a novel attention mechanism designed to enable efficient training and inference of Large Language Models (LLMs) with long contexts. It addresses limitations of existing methods like DeepSpeed-Ulysses and Ring-Attention by synergizing their strengths, offering improved versatility and performance for researchers and engineers working with extended sequence lengths.

How It Works

USP combines DeepSpeed-Ulysses-Attention and Ring-Attention to overcome individual drawbacks. Ulysses is limited by head count and compatibility with Tensor Parallelism, while Ring-Attention is less efficient and prone to deadlocks. USP offers a unified approach, allowing for flexible configuration (e.g., "zigzag" or "stripe" for load balancing) and supporting various hardware backends, including those without FlashAttention via a PyTorch implementation.

Quick Start & Requirements

Installation: pip install yunchang (requires flash-attn 2.6.x or 2.7.x for GPU acceleration). FlashAttention V3 requires installation from source. A PyTorch-based implementation is available for hardware without FlashAttention (attn_type=AttnType.TORCH), though backward pass is not supported.
Dependencies: FlashAttention (v2 or v3), PyTorch. AMD GPU support is available via install_amd.md.
Usage: Integrate by setting process groups with set_seq_parallel_pg and using LongContextAttention as a drop-in replacement for standard attention layers. Examples and testing scripts are provided in the test/ directory.

Highlighted Details

Synergizes DeepSpeed-Ulysses and Ring-Attention for enhanced long-context LLM training.
Offers flexibility with different ring_impl_type ("zigzag", "stripe", "basic") and attn_type (FA, FA3, TORCH).
Verified for accuracy against Data Parallelism in Megatron-LM.
Benchmarks show competitive performance, with QKV-packed versions outperforming unpacked versions, especially at shorter sequence lengths.
Supports hybrid parallelism for heterogeneous network devices.

Maintenance & Community

The project has been integrated into several notable projects including NVIDIA/TransformerEngine, xdit-project/xDiT, and NVlabs/VILA, indicating active adoption and validation.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch-based attention implementation (AttnType.TORCH) does not support the backward pass. The "zigzag" and "stripe" implementations have specific sequence dimension layout requirements.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days