kvpress  by NVIDIA

LLM KV cache compression made easy

created 8 months ago
560 stars

Top 58.2% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides easy-to-use KV cache compression methods for LLMs, targeting researchers and developers aiming to reduce the significant memory footprint of long-context inference. It offers a simplified interface to apply and benchmark various compression techniques, enabling more efficient deployment of large models.

How It Works

kvpress implements compression by applying custom forward hooks to attention layers during the pre-filling phase. These hooks modify the KV cache based on different scoring mechanisms (e.g., random, norm-based, attention-weighted) or structural approaches (e.g., chunking, layer-specific ratios). This allows for significant memory reduction, with the goal of maintaining inference speed and accuracy.

Quick Start & Requirements

  • Install: pip install kvpress
  • Recommended: pip install flash-attn --no-build-isolation for optimized attention.
  • Requires CUDA-enabled GPU.
  • Usage example: pipeline("kv-press-text-generation", ...)
  • Demo notebooks available for detailed examples and evaluation.

Highlighted Details

  • Offers a wide array of 15+ compression methods, including RandomPress, KnormPress, SnapKVPress, ExpectedAttentionPress, StreamingLLM, TOVA, QFilterPress, and more.
  • Supports composition of methods via wrapper presses like AdaKVPress and ComposedPress.
  • Integrates with Hugging Face transformers pipelines and supports quantization via QuantizedCache.
  • Benchmarking tools and notebooks are provided for measuring memory and speed gains.

Maintenance & Community

  • Open to contributions via issues and pull requests.
  • A guide for adding new presses is available in new_press.ipynb.
  • Links to relevant "Awesome" lists for KV cache compression are provided.

Licensing & Compatibility

  • No explicit license mentioned in the README.
  • Compatible with Hugging Face transformers models, tested with Llama, Mistral, Phi-3, and Qwen2.

Limitations & Caveats

  • Some presses are model-architecture dependent and may not work with all LLMs.
  • Flash Attention 2 is recommended but eager attention is required for ObservedAttentionPress.
  • QuantizedCache requires additional dependencies like optimum-quanto.
Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
14
Issues (30d)
10
Star History
98 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.