Discover and explore top open-source AI tools and projects—updated daily.
qiuzh20Gated attention for LLMs: Non-linearity, sparsity, and attention-sink-free
Top 45.3% on SourcePulse
This repository provides the official implementation of "Gated Attention" for Large Language Models (LLMs), based on the Qwen3 architecture. It addresses critical LLM challenges such as training instability and the "attention sink" phenomenon, particularly in long-context scenarios. By introducing query-dependent sparse gating, the project offers improved performance, enhanced stability, and better generalization, making it valuable for researchers and practitioners seeking to optimize LLM efficiency and long-context capabilities.
How It Works
The core innovation lies in applying a query-dependent sparse gate immediately after the Scaled Dot-Product Attention (SDPA) output. This mechanism modulates attention heads independently (headwise) or elements of the attention output (elementwise) via a sigmoid function. This approach introduces crucial non-linearity, enables input-dependent sparsity to prevent early tokens from dominating attention distributions (the "attention sink"), significantly improves training stability by allowing higher learning rates, and enhances extrapolation capabilities for ultra-long contexts.
Quick Start & Requirements
pip install transformers matplotlib numpy torchpython demo.py to visualize attention maps.transformers, matplotlib, numpy, and torch. Specific Python versions or hardware (e.g., GPU) are not explicitly mandated for the demo but are typical for PyTorch-based LLM work.https://arxiv.org/abs/2505.06708.Highlighted Details
Maintenance & Community
qzh11628@gmail.com.Licensing & Compatibility
Limitations & Caveats
The provided documentation focuses on the core implementation and demonstration. Detailed guidance on large-scale training configurations, comprehensive performance benchmarks beyond stated claims, or specific hardware requirements for advanced use cases are not elaborated. The lack of a clear license is a notable adoption blocker.
3 weeks ago
Inactive
datamllab
microsoft
google-research
deepseek-ai
MiniMax-AI