Framework for efficient long-context LLM inference
Top 64.7% on sourcepulse
DuoAttention addresses the significant memory and latency challenges of long-context LLM inference. It targets researchers and engineers working with large language models who need to process extended contexts efficiently without sacrificing performance. The framework offers substantial reductions in pre-filling and decoding resource consumption.
How It Works
DuoAttention identifies that only a subset of attention heads, termed "Retrieval Heads," are critical for long-context processing, while others ("Streaming Heads") focus on recent tokens. It applies a full KV cache only to Retrieval Heads and a lightweight, constant-length cache to Streaming Heads. This selective caching significantly reduces memory footprint and latency, with retrieval heads identified via an optimization-based algorithm trained on synthetic data.
Quick Start & Requirements
pip install -e .
(after setting up the environment).transformers==4.45.2
, flash-attn==2.6.3
, flashinfer
.qserve
and installing its dependencies.Highlighted Details
Maintenance & Community
The project originates from the mit-han-lab at UC Berkeley. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.
Licensing & Compatibility
The README does not explicitly state a license. The code is provided for research purposes, and commercial use implications are not detailed.
Limitations & Caveats
The project is presented as a research artifact for ICLR 2025. While it demonstrates significant efficiency gains, the accuracy impact of the identified retrieval head ratios (25% for MHA, 50% for GQA) on specific benchmarks is highlighted, implying potential trade-offs for other tasks or models. The setup involves complex dependency management, including specific CUDA and PyTorch versions.
5 months ago
1 day