MagiAttention  by SandAI-org

Distributed attention mechanism research paper for ultra-long context, heterogeneous data training

created 3 months ago
447 stars

Top 68.2% on sourcepulse

GitHubView on GitHub
Project Summary

MagiAttention is a distributed attention mechanism designed for ultra-long context and heterogeneous mask training, targeting researchers and engineers working with large-scale models. It offers linear scalability with context-parallel size and flexible support for various attention mask types, aiming to improve training efficiency for tasks like video generation.

How It Works

MagiAttention employs a context-parallel (CP) strategy with several key innovations. It features a flexible Flash Attention kernel (FFA) capable of handling irregular attention masks with performance comparable to Flash-Attention 3 on Hopper GPUs. To ensure balanced computation, it uses a fine-grained sharding strategy with an efficient dispatch solver. Communication is optimized through novel primitives, GroupCast and GroupReduce, built on All-to-All-v, minimizing redundant communication. An adaptive multi-stage compute-communication overlap strategy further hides communication latency.

Quick Start & Requirements

  • Installation: Requires activating an NGC PyTorch Docker container (e.g., nvcr.io/nvidia/pytorch:25.02-py3), installing requirements (pip install -r requirements.txt), and then installing MagiAttention from source (git clone ..., pip install --no-build-isolation .).
  • Prerequisites: Hopper GPUs are currently required.
  • Resources: Setup involves Docker, CUDA, and PyTorch. Specific resource requirements for training are not detailed but imply significant GPU memory and compute for ultra-long contexts.
  • Links: Blog, FSDP2 Integration Example

Highlighted Details

  • Implements a flexible Flash Attention kernel (FFA) with performance comparable to Flash-Attention 3 on Hopper GPUs.
  • Achieves linear scalability with context-parallel size and sequence length, outperforming baselines like Ring-Attention and Ulysses.
  • Supports a wide range of attention mask types, including full, causal, inverse causal, bidirectional causal, and sliding-window masks.
  • Integrates with PyTorch's FSDP and aims for Megatron-LM compatibility.

Maintenance & Community

The project is actively developed by SandAI and contributors from Nanjing University, Peking University. Further community engagement details (Discord/Slack) are not provided.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, MagiAttention is restricted to Hopper GPUs, with plans to broaden support. Documentation for Megatron-LM integration and general API usage is marked as "Coming soon."

Health Check
Last commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
48
Issues (30d)
12
Star History
158 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 20 hours ago
Feedback? Help us improve.