MagiAttention by SandAI-org

Distributed attention mechanism research paper for ultra-long context, heterogeneous data training

Created 7 months ago

570 stars

Top 56.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

MagiAttention is a distributed attention mechanism designed for ultra-long context and heterogeneous mask training, targeting researchers and engineers working with large-scale models. It offers linear scalability with context-parallel size and flexible support for various attention mask types, aiming to improve training efficiency for tasks like video generation.

How It Works

MagiAttention employs a context-parallel (CP) strategy with several key innovations. It features a flexible Flash Attention kernel (FFA) capable of handling irregular attention masks with performance comparable to Flash-Attention 3 on Hopper GPUs. To ensure balanced computation, it uses a fine-grained sharding strategy with an efficient dispatch solver. Communication is optimized through novel primitives, GroupCast and GroupReduce, built on All-to-All-v, minimizing redundant communication. An adaptive multi-stage compute-communication overlap strategy further hides communication latency.

Quick Start & Requirements

Installation: Requires activating an NGC PyTorch Docker container (e.g., nvcr.io/nvidia/pytorch:25.02-py3), installing requirements (pip install -r requirements.txt), and then installing MagiAttention from source (git clone ..., pip install --no-build-isolation .).
Prerequisites: Hopper GPUs are currently required.
Resources: Setup involves Docker, CUDA, and PyTorch. Specific resource requirements for training are not detailed but imply significant GPU memory and compute for ultra-long contexts.
Links: Blog, FSDP2 Integration Example

Highlighted Details

Implements a flexible Flash Attention kernel (FFA) with performance comparable to Flash-Attention 3 on Hopper GPUs.
Achieves linear scalability with context-parallel size and sequence length, outperforming baselines like Ring-Attention and Ulysses.
Supports a wide range of attention mask types, including full, causal, inverse causal, bidirectional causal, and sliding-window masks.
Integrates with PyTorch's FSDP and aims for Megatron-LM compatibility.

Maintenance & Community

The project is actively developed by SandAI and contributors from Nanjing University, Peking University. Further community engagement details (Discord/Slack) are not provided.

Licensing & Compatibility

License: Apache License 2.0.
Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, MagiAttention is restricted to Hopper GPUs, with plans to broaden support. Documentation for Megatron-LM integration and general API usage is marked as "Coming soon."

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days