MoDA  by hustvl

Efficient attention mechanism for deep language models

Created 2 months ago
266 stars

Top 96.0% on SourcePulse

GitHubView on GitHub
Project Summary

Mixture-of-Depths Attention (MoDA) addresses signal degradation in deep LLMs by allowing attention heads to access information from preceding layers. This mechanism enables heads to attend to both current sequence KV pairs and KV pairs from earlier depths, improving feature propagation. MoDA offers a hardware-efficient implementation, serving as a promising primitive for scaling model depth without significant computational overhead, benefiting researchers and engineers developing advanced deep learning models.

How It Works

MoDA integrates a "depth stream" alongside standard sequence attention. Each head queries KV pairs from the current layer's sequence and from depth streams of preceding layers, mitigating feature dilution in deep models. The project emphasizes a hardware-efficient implementation that resolves non-contiguous memory access. A "Chunk/Group-aware MoDA" variant further optimizes depth KV calculation by reorganizing data based on chunk size and GQA groups, reducing memory access overhead.

Quick Start & Requirements

Installation involves cloning the repo and locally installing the MoDA-enabled fla package: cd libs/moda_triton && pip install -e .. Dependencies include PyTorch (>= 2.5), Triton (>= 3.0), einops, transformers (>= 4.53.0), datasets (>= 3.3.0), and causal-conv1d (>= 1.4.0). Example commands for testing the Triton kernel and training vision tasks (DeiT on ImageNet) are provided.

Highlighted Details

  • Achieves 97.3% of FlashAttention-2 efficiency at 64K sequence length.
  • Improves LLMs: 0.2 average perplexity reduction and 2.11% average performance increase on downstream tasks for 1.5B models, with only 3.7% FLOPs overhead.
  • Consistently allocates attention mass to the Depth KV block across layers and heads.
  • Offers training recipes for vision tasks like ImageNet classification.

Maintenance & Community

Developed by researchers from Huazhong University of Science & Technology and ByteDance. Updates are shared via X/Twitter and blog articles. No explicit community channels or public roadmap are detailed.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README, requiring further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The project is under active development, with a stated TODO to "Release full LLM training recipe and reproducible configs." Comprehensive LLM training configurations are pending, though vision task recipes are available.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Mehdi Amini Mehdi Amini(Author of MLIR; Distinguished Engineer at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

flashinfer by flashinfer-ai

0.8%
6k
Kernel library for LLM serving
Created 2 years ago
Updated 19 hours ago
Feedback? Help us improve.