duo-attention  by mit-han-lab

Framework for efficient long-context LLM inference

created 9 months ago
480 stars

Top 64.7% on sourcepulse

GitHubView on GitHub
Project Summary

DuoAttention addresses the significant memory and latency challenges of long-context LLM inference. It targets researchers and engineers working with large language models who need to process extended contexts efficiently without sacrificing performance. The framework offers substantial reductions in pre-filling and decoding resource consumption.

How It Works

DuoAttention identifies that only a subset of attention heads, termed "Retrieval Heads," are critical for long-context processing, while others ("Streaming Heads") focus on recent tokens. It applies a full KV cache only to Retrieval Heads and a lightweight, constant-length cache to Streaming Heads. This selective caching significantly reduces memory footprint and latency, with retrieval heads identified via an optimization-based algorithm trained on synthetic data.

Quick Start & Requirements

  • Installation: pip install -e . (after setting up the environment).
  • Prerequisites: Python 3.10, CUDA 12.4, PyTorch 2.4, transformers==4.45.2, flash-attn==2.6.3, flashinfer.
  • Setup: Requires environment setup with Conda, installing dependencies, and downloading datasets/models. The demo requires cloning qserve and installing its dependencies.
  • Links: Paper, Slides, Demo

Highlighted Details

  • Achieves up to 2.55x memory reduction and 2.18x latency reduction for MHA models.
  • Enables Llama-3-8B with a 3.3 million context length on a single A100 GPU when combined with quantization.
  • Provides comparable accuracy to full attention on Needle-in-a-Haystack benchmarks.
  • Offers better KV budget and accuracy trade-offs on LongBench benchmarks.

Maintenance & Community

The project originates from the mit-han-lab at UC Berkeley. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes, and commercial use implications are not detailed.

Limitations & Caveats

The project is presented as a research artifact for ICLR 2025. While it demonstrates significant efficiency gains, the accuracy impact of the identified retrieval head ratios (25% for MHA, 50% for GQA) on specific benchmarks is highlighted, implying potential trade-offs for other tasks or models. The setup involves complex dependency management, including specific CUDA and PyTorch versions.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.