long-context-attention  by feifeibear

Unified sequence parallel attention for long context LLM training/inference

created 1 year ago
537 stars

Top 59.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides Unified Sequence Parallelism (USP), a novel attention mechanism designed to enable efficient training and inference of Large Language Models (LLMs) with long contexts. It addresses limitations of existing methods like DeepSpeed-Ulysses and Ring-Attention by synergizing their strengths, offering improved versatility and performance for researchers and engineers working with extended sequence lengths.

How It Works

USP combines DeepSpeed-Ulysses-Attention and Ring-Attention to overcome individual drawbacks. Ulysses is limited by head count and compatibility with Tensor Parallelism, while Ring-Attention is less efficient and prone to deadlocks. USP offers a unified approach, allowing for flexible configuration (e.g., "zigzag" or "stripe" for load balancing) and supporting various hardware backends, including those without FlashAttention via a PyTorch implementation.

Quick Start & Requirements

  • Installation: pip install yunchang (requires flash-attn 2.6.x or 2.7.x for GPU acceleration). FlashAttention V3 requires installation from source. A PyTorch-based implementation is available for hardware without FlashAttention (attn_type=AttnType.TORCH), though backward pass is not supported.
  • Dependencies: FlashAttention (v2 or v3), PyTorch. AMD GPU support is available via install_amd.md.
  • Usage: Integrate by setting process groups with set_seq_parallel_pg and using LongContextAttention as a drop-in replacement for standard attention layers. Examples and testing scripts are provided in the test/ directory.

Highlighted Details

  • Synergizes DeepSpeed-Ulysses and Ring-Attention for enhanced long-context LLM training.
  • Offers flexibility with different ring_impl_type ("zigzag", "stripe", "basic") and attn_type (FA, FA3, TORCH).
  • Verified for accuracy against Data Parallelism in Megatron-LM.
  • Benchmarks show competitive performance, with QKV-packed versions outperforming unpacked versions, especially at shorter sequence lengths.
  • Supports hybrid parallelism for heterogeneous network devices.

Maintenance & Community

The project has been integrated into several notable projects including NVIDIA/TransformerEngine, xdit-project/xDiT, and NVlabs/VILA, indicating active adoption and validation.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch-based attention implementation (AttnType.TORCH) does not support the backward pass. The "zigzag" and "stripe" implementations have specific sequence dimension layout requirements.

Health Check
Last commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
1
Star History
50 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
258
Efficiently train foundation models with PyTorch
created 1 year ago
updated 1 week ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Zhuohan Li Zhuohan Li(Author of vLLM), and
6 more.

torchtitan by pytorch

0.9%
4k
PyTorch platform for generative AI model training research
created 1 year ago
updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 20 hours ago
Feedback? Help us improve.