H2O  by FMInference

KV cache eviction research paper for efficient LLM inference

created 2 years ago
462 stars

Top 66.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

H2O (Heavy-Hitter Oracle) addresses the significant memory overhead of KV caches in Large Language Models (LLMs) during generative inference, particularly for long-content applications. Targeting researchers and engineers working with LLMs, it offers a novel KV cache eviction policy that drastically reduces memory footprint and improves inference throughput.

How It Works

H2O leverages the observation that a small subset of tokens, termed "Heavy Hitters" (H2), disproportionately contribute to attention scores. These H2 tokens are identified through their frequent co-occurrence in text. The H2O eviction policy dynamically maintains a balance between recent and H2 tokens, formulated as a dynamic submodular optimization problem with theoretical guarantees. This approach aims to retain crucial contextual information while discarding less impactful tokens, thereby reducing memory usage without significant performance degradation.

Quick Start & Requirements

  • Install: Code provided for integration with FlexGen (h2o_flexgen) and Hugging Face (h2o_hf). Specific installation commands are not detailed in the README.
  • Prerequisites: Requires Python, and likely dependencies associated with FlexGen and Hugging Face Transformers. GPU acceleration is implied for LLM inference.
  • Resources: No specific setup time or resource footprint is mentioned.
  • Links: NeurIPS'23 Paper

Highlighted Details

  • Improves throughput by up to 29x over DeepSpeed Zero-Inference and Hugging Face Accelerate on OPT-6.7B and OPT-30B.
  • Reduces latency by up to 1.9x for the same batch size.
  • Validated on OPT, LLaMA, and GPT-NeoX models.
  • Provides both simulation (masking attention matrix) and real KV dropping implementations.

Maintenance & Community

  • Developed by Zhenyu Zhang, Ying Sheng, Tianyi Zhou, et al.
  • Associated with the NeurIPS'23 paper. No specific community channels or roadmap are linked in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The implementation is presented as code for a research paper, and its production-readiness or long-term maintenance status is not specified.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
1
Star History
20 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.