Distributed attention mechanism research paper for ultra-long context, heterogeneous data training
Top 68.2% on sourcepulse
MagiAttention is a distributed attention mechanism designed for ultra-long context and heterogeneous mask training, targeting researchers and engineers working with large-scale models. It offers linear scalability with context-parallel size and flexible support for various attention mask types, aiming to improve training efficiency for tasks like video generation.
How It Works
MagiAttention employs a context-parallel (CP) strategy with several key innovations. It features a flexible Flash Attention kernel (FFA) capable of handling irregular attention masks with performance comparable to Flash-Attention 3 on Hopper GPUs. To ensure balanced computation, it uses a fine-grained sharding strategy with an efficient dispatch solver. Communication is optimized through novel primitives, GroupCast and GroupReduce, built on All-to-All-v, minimizing redundant communication. An adaptive multi-stage compute-communication overlap strategy further hides communication latency.
Quick Start & Requirements
nvcr.io/nvidia/pytorch:25.02-py3
), installing requirements (pip install -r requirements.txt
), and then installing MagiAttention from source (git clone ...
, pip install --no-build-isolation .
).Highlighted Details
Maintenance & Community
The project is actively developed by SandAI and contributors from Nanjing University, Peking University. Further community engagement details (Discord/Slack) are not provided.
Licensing & Compatibility
Limitations & Caveats
Currently, MagiAttention is restricted to Hopper GPUs, with plans to broaden support. Documentation for Megatron-LM integration and general API usage is marked as "Coming soon."
15 hours ago
Inactive