LLM KV cache compression made easy
Top 58.2% on sourcepulse
This library provides easy-to-use KV cache compression methods for LLMs, targeting researchers and developers aiming to reduce the significant memory footprint of long-context inference. It offers a simplified interface to apply and benchmark various compression techniques, enabling more efficient deployment of large models.
How It Works
kvpress implements compression by applying custom forward hooks to attention layers during the pre-filling phase. These hooks modify the KV cache based on different scoring mechanisms (e.g., random, norm-based, attention-weighted) or structural approaches (e.g., chunking, layer-specific ratios). This allows for significant memory reduction, with the goal of maintaining inference speed and accuracy.
Quick Start & Requirements
pip install kvpress
pip install flash-attn --no-build-isolation
for optimized attention.pipeline("kv-press-text-generation", ...)
Highlighted Details
transformers
pipelines and supports quantization via QuantizedCache
.Maintenance & Community
new_press.ipynb
.Licensing & Compatibility
transformers
models, tested with Llama, Mistral, Phi-3, and Qwen2.Limitations & Caveats
eager
attention is required for ObservedAttentionPress
.optimum-quanto
.1 day ago
1 day