MoDA by hustvl

Efficient attention mechanism for deep language models

Created 3 months ago

273 stars

Top 94.3% on SourcePulse

Project Summary

Mixture-of-Depths Attention (MoDA) addresses signal degradation in deep LLMs by allowing attention heads to access information from preceding layers. This mechanism enables heads to attend to both current sequence KV pairs and KV pairs from earlier depths, improving feature propagation. MoDA offers a hardware-efficient implementation, serving as a promising primitive for scaling model depth without significant computational overhead, benefiting researchers and engineers developing advanced deep learning models.

How It Works

MoDA integrates a "depth stream" alongside standard sequence attention. Each head queries KV pairs from the current layer's sequence and from depth streams of preceding layers, mitigating feature dilution in deep models. The project emphasizes a hardware-efficient implementation that resolves non-contiguous memory access. A "Chunk/Group-aware MoDA" variant further optimizes depth KV calculation by reorganizing data based on chunk size and GQA groups, reducing memory access overhead.

Quick Start & Requirements

Installation involves cloning the repo and locally installing the MoDA-enabled fla package: cd libs/moda_triton && pip install -e .. Dependencies include PyTorch (>= 2.5), Triton (>= 3.0), einops, transformers (>= 4.53.0), datasets (>= 3.3.0), and causal-conv1d (>= 1.4.0). Example commands for testing the Triton kernel and training vision tasks (DeiT on ImageNet) are provided.

Highlighted Details

Achieves 97.3% of FlashAttention-2 efficiency at 64K sequence length.
Improves LLMs: 0.2 average perplexity reduction and 2.11% average performance increase on downstream tasks for 1.5B models, with only 3.7% FLOPs overhead.
Consistently allocates attention mass to the Depth KV block across layers and heads.
Offers training recipes for vision tasks like ImageNet classification.

Maintenance & Community

Developed by researchers from Huazhong University of Science & Technology and ByteDance. Updates are shared via X/Twitter and blog articles. No explicit community channels or public roadmap are detailed.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README, requiring further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The project is under active development, with a stated TODO to "Release full LLM training recipe and reproducible configs." Comprehensive LLM training configurations are pending, though vision task recipes are available.

MoDA by hustvl

Explore Similar Projects

native-sparse-attention-triton by XunhaoLai

Star-Attention by NVIDIA

FastV by pkunlp-icler

triton-flash-attention by hkproj

Quest by mit-han-lab

flash-sparse-attention by HKUSTDial

Block-Sparse-Attention by mit-han-lab

cuLA by inclusionAI

Kimi-Linear by MoonshotAI

MInference by microsoft

long-context-attention by feifeibear

flashinfer by flashinfer-ai