duo-attention  by mit-han-lab

Framework for efficient long-context LLM inference

Created 1 year ago
499 stars

Top 62.1% on SourcePulse

GitHubView on GitHub
Project Summary

DuoAttention addresses the significant memory and latency challenges of long-context LLM inference. It targets researchers and engineers working with large language models who need to process extended contexts efficiently without sacrificing performance. The framework offers substantial reductions in pre-filling and decoding resource consumption.

How It Works

DuoAttention identifies that only a subset of attention heads, termed "Retrieval Heads," are critical for long-context processing, while others ("Streaming Heads") focus on recent tokens. It applies a full KV cache only to Retrieval Heads and a lightweight, constant-length cache to Streaming Heads. This selective caching significantly reduces memory footprint and latency, with retrieval heads identified via an optimization-based algorithm trained on synthetic data.

Quick Start & Requirements

  • Installation: pip install -e . (after setting up the environment).
  • Prerequisites: Python 3.10, CUDA 12.4, PyTorch 2.4, transformers==4.45.2, flash-attn==2.6.3, flashinfer.
  • Setup: Requires environment setup with Conda, installing dependencies, and downloading datasets/models. The demo requires cloning qserve and installing its dependencies.
  • Links: Paper, Slides, Demo

Highlighted Details

  • Achieves up to 2.55x memory reduction and 2.18x latency reduction for MHA models.
  • Enables Llama-3-8B with a 3.3 million context length on a single A100 GPU when combined with quantization.
  • Provides comparable accuracy to full attention on Needle-in-a-Haystack benchmarks.
  • Offers better KV budget and accuracy trade-offs on LongBench benchmarks.

Maintenance & Community

The project originates from the mit-han-lab at UC Berkeley. No specific community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes, and commercial use implications are not detailed.

Limitations & Caveats

The project is presented as a research artifact for ICLR 2025. While it demonstrates significant efficiency gains, the accuracy impact of the identified retrieval head ratios (25% for MHA, 50% for GQA) on specific benchmarks is highlighted, implying potential trade-offs for other tasks or models. The setup involves complex dependency management, including specific CUDA and PyTorch versions.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.