Discover and explore top open-source AI tools and projects—updated daily.
NVIDIALLM KV cache compression made easy
Top 50.2% on SourcePulse
This library provides easy-to-use KV cache compression methods for LLMs, targeting researchers and developers aiming to reduce the significant memory footprint of long-context inference. It offers a simplified interface to apply and benchmark various compression techniques, enabling more efficient deployment of large models.
How It Works
kvpress implements compression by applying custom forward hooks to attention layers during the pre-filling phase. These hooks modify the KV cache based on different scoring mechanisms (e.g., random, norm-based, attention-weighted) or structural approaches (e.g., chunking, layer-specific ratios). This allows for significant memory reduction, with the goal of maintaining inference speed and accuracy.
Quick Start & Requirements
pip install kvpresspip install flash-attn --no-build-isolation for optimized attention.pipeline("kv-press-text-generation", ...)Highlighted Details
transformers pipelines and supports quantization via QuantizedCache.Maintenance & Community
new_press.ipynb.Licensing & Compatibility
transformers models, tested with Llama, Mistral, Phi-3, and Qwen2.Limitations & Caveats
eager attention is required for ObservedAttentionPress.optimum-quanto.1 week ago
1 day
facebookresearch
microsoft