Research paper introducing MoBA for long-context LLMs
Top 23.9% on sourcepulse
MoBA (Mixture of Block Attention) addresses the quadratic complexity of attention mechanisms in Large Language Models (LLMs) for long-context processing. It targets researchers and developers building or fine-tuning LLMs, offering a flexible and efficient alternative to standard attention that can transition between full and sparse modes without performance compromise.
How It Works
MoBA applies Mixture of Experts (MoE) principles to the attention mechanism. It divides the full context into blocks and uses a parameter-less top-k gating mechanism for each query token to select the most relevant KV blocks. This allows the model to autonomously learn where to attend, avoiding predefined biases of other sparse attention methods. The approach is designed for seamless integration and continued training with existing models.
Quick Start & Requirements
conda create -n moba python=3.10
, conda activate moba
, pip install .
flash-attn==2.6.3
, torch >= 2.1.0
.python3 examples/llama.py --model meta-llama/Llama-3.1-8B --attn moba
Highlighted Details
moba_efficient
implementation with up to 40x speedup over moba_naive
(tested at 32K sequence length, 1 attention head, MoBA Block 2048, MoBA Topk 3).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
moba_naive
implementation is for understanding and visualization, not production use.4 months ago
Inactive