H2O by FMInference

KV cache eviction research paper for efficient LLM inference

Created 2 years ago

499 stars

Top 62.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Pawel Garbacki

Cofounder of Fireworks AI

Ying Sheng

Coauthor of SGLang

Project Summary

H2O (Heavy-Hitter Oracle) addresses the significant memory overhead of KV caches in Large Language Models (LLMs) during generative inference, particularly for long-content applications. Targeting researchers and engineers working with LLMs, it offers a novel KV cache eviction policy that drastically reduces memory footprint and improves inference throughput.

How It Works

H2O leverages the observation that a small subset of tokens, termed "Heavy Hitters" (H2), disproportionately contribute to attention scores. These H2 tokens are identified through their frequent co-occurrence in text. The H2O eviction policy dynamically maintains a balance between recent and H2 tokens, formulated as a dynamic submodular optimization problem with theoretical guarantees. This approach aims to retain crucial contextual information while discarding less impactful tokens, thereby reducing memory usage without significant performance degradation.

Quick Start & Requirements

Install: Code provided for integration with FlexGen (h2o_flexgen) and Hugging Face (h2o_hf). Specific installation commands are not detailed in the README.
Prerequisites: Requires Python, and likely dependencies associated with FlexGen and Hugging Face Transformers. GPU acceleration is implied for LLM inference.
Resources: No specific setup time or resource footprint is mentioned.
Links: NeurIPS'23 Paper

Highlighted Details

Improves throughput by up to 29x over DeepSpeed Zero-Inference and Hugging Face Accelerate on OPT-6.7B and OPT-30B.
Reduces latency by up to 1.9x for the same batch size.
Validated on OPT, LLaMA, and GPT-NeoX models.
Provides both simulation (masking attention matrix) and real KV dropping implementations.

Maintenance & Community

Developed by Zhenyu Zhang, Ying Sheng, Tianyi Zhou, et al.
Associated with the NeurIPS'23 paper. No specific community channels or roadmap are linked in the README.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The implementation is presented as code for a research paper, and its production-readiness or long-term maintenance status is not specified.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days