MagiAttention  by SandAI-org

Distributed attention mechanism research paper for ultra-long context, heterogeneous data training

Created 5 months ago
533 stars

Top 59.6% on SourcePulse

GitHubView on GitHub
Project Summary

MagiAttention is a distributed attention mechanism designed for ultra-long context and heterogeneous mask training, targeting researchers and engineers working with large-scale models. It offers linear scalability with context-parallel size and flexible support for various attention mask types, aiming to improve training efficiency for tasks like video generation.

How It Works

MagiAttention employs a context-parallel (CP) strategy with several key innovations. It features a flexible Flash Attention kernel (FFA) capable of handling irregular attention masks with performance comparable to Flash-Attention 3 on Hopper GPUs. To ensure balanced computation, it uses a fine-grained sharding strategy with an efficient dispatch solver. Communication is optimized through novel primitives, GroupCast and GroupReduce, built on All-to-All-v, minimizing redundant communication. An adaptive multi-stage compute-communication overlap strategy further hides communication latency.

Quick Start & Requirements

  • Installation: Requires activating an NGC PyTorch Docker container (e.g., nvcr.io/nvidia/pytorch:25.02-py3), installing requirements (pip install -r requirements.txt), and then installing MagiAttention from source (git clone ..., pip install --no-build-isolation .).
  • Prerequisites: Hopper GPUs are currently required.
  • Resources: Setup involves Docker, CUDA, and PyTorch. Specific resource requirements for training are not detailed but imply significant GPU memory and compute for ultra-long contexts.
  • Links: Blog, FSDP2 Integration Example

Highlighted Details

  • Implements a flexible Flash Attention kernel (FFA) with performance comparable to Flash-Attention 3 on Hopper GPUs.
  • Achieves linear scalability with context-parallel size and sequence length, outperforming baselines like Ring-Attention and Ulysses.
  • Supports a wide range of attention mask types, including full, causal, inverse causal, bidirectional causal, and sliding-window masks.
  • Integrates with PyTorch's FSDP and aims for Megatron-LM compatibility.

Maintenance & Community

The project is actively developed by SandAI and contributors from Nanjing University, Peking University. Further community engagement details (Discord/Slack) are not provided.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, MagiAttention is restricted to Hopper GPUs, with plans to broaden support. Documentation for Megatron-LM integration and general API usage is marked as "Coming soon."

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
2
Star History
27 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

ring-attention-pytorch by lucidrains

0.2%
540
Pytorch impl of Ring Attention for near-infinite context
Created 1 year ago
Updated 5 months ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

1.5%
1k
Framework for scaling multimodal model training across accelerators
Created 6 months ago
Updated 4 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
27 more.

ColossalAI by hpcaitech

0.0%
41k
AI system for large-scale parallel training
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.