kvpress  by NVIDIA

LLM KV cache compression made easy

Created 1 year ago
748 stars

Top 46.5% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides easy-to-use KV cache compression methods for LLMs, targeting researchers and developers aiming to reduce the significant memory footprint of long-context inference. It offers a simplified interface to apply and benchmark various compression techniques, enabling more efficient deployment of large models.

How It Works

kvpress implements compression by applying custom forward hooks to attention layers during the pre-filling phase. These hooks modify the KV cache based on different scoring mechanisms (e.g., random, norm-based, attention-weighted) or structural approaches (e.g., chunking, layer-specific ratios). This allows for significant memory reduction, with the goal of maintaining inference speed and accuracy.

Quick Start & Requirements

  • Install: pip install kvpress
  • Recommended: pip install flash-attn --no-build-isolation for optimized attention.
  • Requires CUDA-enabled GPU.
  • Usage example: pipeline("kv-press-text-generation", ...)
  • Demo notebooks available for detailed examples and evaluation.

Highlighted Details

  • Offers a wide array of 15+ compression methods, including RandomPress, KnormPress, SnapKVPress, ExpectedAttentionPress, StreamingLLM, TOVA, QFilterPress, and more.
  • Supports composition of methods via wrapper presses like AdaKVPress and ComposedPress.
  • Integrates with Hugging Face transformers pipelines and supports quantization via QuantizedCache.
  • Benchmarking tools and notebooks are provided for measuring memory and speed gains.

Maintenance & Community

  • Open to contributions via issues and pull requests.
  • A guide for adding new presses is available in new_press.ipynb.
  • Links to relevant "Awesome" lists for KV cache compression are provided.

Licensing & Compatibility

  • No explicit license mentioned in the README.
  • Compatible with Hugging Face transformers models, tested with Llama, Mistral, Phi-3, and Qwen2.

Limitations & Caveats

  • Some presses are model-architecture dependent and may not work with all LLMs.
  • Flash Attention 2 is recommended but eager attention is required for ObservedAttentionPress.
  • QuantizedCache requires additional dependencies like optimum-quanto.
Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
2
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
7 more.

LLMLingua by microsoft

0.4%
6k
Prompt compression for accelerated LLM inference
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.