H2O  by FMInference

KV cache eviction research paper for efficient LLM inference

Created 2 years ago
499 stars

Top 62.3% on SourcePulse

GitHubView on GitHub
Project Summary

H2O (Heavy-Hitter Oracle) addresses the significant memory overhead of KV caches in Large Language Models (LLMs) during generative inference, particularly for long-content applications. Targeting researchers and engineers working with LLMs, it offers a novel KV cache eviction policy that drastically reduces memory footprint and improves inference throughput.

How It Works

H2O leverages the observation that a small subset of tokens, termed "Heavy Hitters" (H2), disproportionately contribute to attention scores. These H2 tokens are identified through their frequent co-occurrence in text. The H2O eviction policy dynamically maintains a balance between recent and H2 tokens, formulated as a dynamic submodular optimization problem with theoretical guarantees. This approach aims to retain crucial contextual information while discarding less impactful tokens, thereby reducing memory usage without significant performance degradation.

Quick Start & Requirements

  • Install: Code provided for integration with FlexGen (h2o_flexgen) and Hugging Face (h2o_hf). Specific installation commands are not detailed in the README.
  • Prerequisites: Requires Python, and likely dependencies associated with FlexGen and Hugging Face Transformers. GPU acceleration is implied for LLM inference.
  • Resources: No specific setup time or resource footprint is mentioned.
  • Links: NeurIPS'23 Paper

Highlighted Details

  • Improves throughput by up to 29x over DeepSpeed Zero-Inference and Hugging Face Accelerate on OPT-6.7B and OPT-30B.
  • Reduces latency by up to 1.9x for the same batch size.
  • Validated on OPT, LLaMA, and GPT-NeoX models.
  • Provides both simulation (masking attention matrix) and real KV dropping implementations.

Maintenance & Community

  • Developed by Zhenyu Zhang, Ying Sheng, Tianyi Zhou, et al.
  • Associated with the NeurIPS'23 paper. No specific community channels or roadmap are linked in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The implementation is presented as code for a research paper, and its production-readiness or long-term maintenance status is not specified.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 10 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.