duo-attention by mit-han-lab

Framework for efficient long-context LLM inference

Created 1 year ago

519 stars

Top 60.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

DuoAttention addresses the significant memory and latency challenges of long-context LLM inference. It targets researchers and engineers working with large language models who need to process extended contexts efficiently without sacrificing performance. The framework offers substantial reductions in pre-filling and decoding resource consumption.

How It Works

DuoAttention identifies that only a subset of attention heads, termed "Retrieval Heads," are critical for long-context processing, while others ("Streaming Heads") focus on recent tokens. It applies a full KV cache only to Retrieval Heads and a lightweight, constant-length cache to Streaming Heads. This selective caching significantly reduces memory footprint and latency, with retrieval heads identified via an optimization-based algorithm trained on synthetic data.

Quick Start & Requirements

Installation: pip install -e . (after setting up the environment).
Prerequisites: Python 3.10, CUDA 12.4, PyTorch 2.4, transformers==4.45.2, flash-attn==2.6.3, flashinfer.
Setup: Requires environment setup with Conda, installing dependencies, and downloading datasets/models. The demo requires cloning qserve and installing its dependencies.
Links: Paper, Slides, Demo

Highlighted Details

Achieves up to 2.55x memory reduction and 2.18x latency reduction for MHA models.
Enables Llama-3-8B with a 3.3 million context length on a single A100 GPU when combined with quantization.
Provides comparable accuracy to full attention on Needle-in-a-Haystack benchmarks.
Offers better KV budget and accuracy trade-offs on LongBench benchmarks.

Maintenance & Community

The project originates from the mit-han-lab at UC Berkeley. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes, and commercial use implications are not detailed.

Limitations & Caveats

The project is presented as a research artifact for ICLR 2025. While it demonstrates significant efficiency gains, the accuracy impact of the identified retrieval head ratios (25% for MHA, 50% for GQA) on specific benchmarks is highlighted, implying potential trade-offs for other tasks or models. The setup involves complex dependency management, including specific CUDA and PyTorch versions.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days