gated_attention  by qiuzh20

Gated attention for LLMs: Non-linearity, sparsity, and attention-sink-free

Created 8 months ago
773 stars

Top 45.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation of "Gated Attention" for Large Language Models (LLMs), based on the Qwen3 architecture. It addresses critical LLM challenges such as training instability and the "attention sink" phenomenon, particularly in long-context scenarios. By introducing query-dependent sparse gating, the project offers improved performance, enhanced stability, and better generalization, making it valuable for researchers and practitioners seeking to optimize LLM efficiency and long-context capabilities.

How It Works

The core innovation lies in applying a query-dependent sparse gate immediately after the Scaled Dot-Product Attention (SDPA) output. This mechanism modulates attention heads independently (headwise) or elements of the attention output (elementwise) via a sigmoid function. This approach introduces crucial non-linearity, enables input-dependent sparsity to prevent early tokens from dominating attention distributions (the "attention sink"), significantly improves training stability by allowing higher learning rates, and enhances extrapolation capabilities for ultra-long contexts.

Quick Start & Requirements

  • Installation: pip install transformers matplotlib numpy torch
  • Demo: Run python demo.py to visualize attention maps.
  • Prerequisites: Requires transformers, matplotlib, numpy, and torch. Specific Python versions or hardware (e.g., GPU) are not explicitly mandated for the demo but are typical for PyTorch-based LLM work.
  • Resources: Links to models are available on Hugging Face, and the paper can be found at https://arxiv.org/abs/2505.06708.

Highlighted Details

  • Recognized with an Oral Presentation at NeurIPS 2025, placing in the top 1.5% of submissions.
  • Successfully integrated into the official Qwen3-Next architecture and deployed in the Qwen3-Next-80B-A3B-Instruct model.
  • Demonstrates significant performance gains on long-context benchmarks like RULER, supporting contexts up to 1 million tokens.
  • Effectively mitigates the "attention sink" phenomenon, leading to more distributed and context-aware attention patterns compared to standard transformer attention.

Maintenance & Community

  • Direct contact is available via email: qzh11628@gmail.com.
  • No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the provided README.

Licensing & Compatibility

  • The README does not specify a software license. This absence may pose compatibility concerns for commercial use or integration into proprietary systems.

Limitations & Caveats

The provided documentation focuses on the core implementation and demonstration. Detailed guidance on large-scale training configurations, comprehensive performance benchmarks beyond stated claims, or specific hardware requirements for advanced use cases are not elaborated. The lack of a clear license is a notable adoption blocker.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
4
Star History
214 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Luis Capelo Luis Capelo(Cofounder of Lightning AI).

LongLM by datamllab

0.1%
665
Self-Extend: LLM context window extension via self-attention
Created 2 years ago
Updated 1 year ago
Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

DeepSeek-V3.2-Exp by deepseek-ai

1.0%
1k
Experimental LLM boosting long-context efficiency
Created 3 months ago
Updated 1 month ago
Feedback? Help us improve.