gated_attention by qiuzh20

Gated attention for LLMs: Non-linearity, sparsity, and attention-sink-free

Created 9 months ago

857 stars

Top 41.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Jiaming Song

Chief Scientist at Luma AI

Project Summary

This repository provides the official implementation of "Gated Attention" for Large Language Models (LLMs), based on the Qwen3 architecture. It addresses critical LLM challenges such as training instability and the "attention sink" phenomenon, particularly in long-context scenarios. By introducing query-dependent sparse gating, the project offers improved performance, enhanced stability, and better generalization, making it valuable for researchers and practitioners seeking to optimize LLM efficiency and long-context capabilities.

How It Works

The core innovation lies in applying a query-dependent sparse gate immediately after the Scaled Dot-Product Attention (SDPA) output. This mechanism modulates attention heads independently (headwise) or elements of the attention output (elementwise) via a sigmoid function. This approach introduces crucial non-linearity, enables input-dependent sparsity to prevent early tokens from dominating attention distributions (the "attention sink"), significantly improves training stability by allowing higher learning rates, and enhances extrapolation capabilities for ultra-long contexts.

Quick Start & Requirements

Installation: pip install transformers matplotlib numpy torch
Demo: Run python demo.py to visualize attention maps.
Prerequisites: Requires transformers, matplotlib, numpy, and torch. Specific Python versions or hardware (e.g., GPU) are not explicitly mandated for the demo but are typical for PyTorch-based LLM work.
Resources: Links to models are available on Hugging Face, and the paper can be found at https://arxiv.org/abs/2505.06708.

Highlighted Details

Recognized with an Oral Presentation at NeurIPS 2025, placing in the top 1.5% of submissions.
Successfully integrated into the official Qwen3-Next architecture and deployed in the Qwen3-Next-80B-A3B-Instruct model.
Demonstrates significant performance gains on long-context benchmarks like RULER, supporting contexts up to 1 million tokens.
Effectively mitigates the "attention sink" phenomenon, leading to more distributed and context-aware attention patterns compared to standard transformer attention.

Maintenance & Community

Direct contact is available via email: qzh11628@gmail.com.
No specific community channels (e.g., Discord, Slack) or public roadmaps are detailed in the provided README.

Licensing & Compatibility

The README does not specify a software license. This absence may pose compatibility concerns for commercial use or integration into proprietary systems.

Limitations & Caveats

The provided documentation focuses on the core implementation and demonstration. Detailed guidance on large-scale training configurations, comprehensive performance benchmarks beyond stated claims, or specific hardware requirements for advanced use cases are not elaborated. The lack of a clear license is a notable adoption blocker.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

50 stars in the last 30 days