kvpress by NVIDIA

LLM KV cache compression made easy

Created 1 year ago

748 stars

Top 46.5% on SourcePulse

2 Experts Love This Project

luiscape

Cofounder of Lightning AI

comaniac

Coauthor of vLLM; MTS at OpenAI

Project Summary

This library provides easy-to-use KV cache compression methods for LLMs, targeting researchers and developers aiming to reduce the significant memory footprint of long-context inference. It offers a simplified interface to apply and benchmark various compression techniques, enabling more efficient deployment of large models.

How It Works

kvpress implements compression by applying custom forward hooks to attention layers during the pre-filling phase. These hooks modify the KV cache based on different scoring mechanisms (e.g., random, norm-based, attention-weighted) or structural approaches (e.g., chunking, layer-specific ratios). This allows for significant memory reduction, with the goal of maintaining inference speed and accuracy.

Quick Start & Requirements

Install: pip install kvpress
Recommended: pip install flash-attn --no-build-isolation for optimized attention.
Requires CUDA-enabled GPU.
Usage example: pipeline("kv-press-text-generation", ...)
Demo notebooks available for detailed examples and evaluation.

Highlighted Details

Offers a wide array of 15+ compression methods, including RandomPress, KnormPress, SnapKVPress, ExpectedAttentionPress, StreamingLLM, TOVA, QFilterPress, and more.
Supports composition of methods via wrapper presses like AdaKVPress and ComposedPress.
Integrates with Hugging Face transformers pipelines and supports quantization via QuantizedCache.
Benchmarking tools and notebooks are provided for measuring memory and speed gains.

Maintenance & Community

Open to contributions via issues and pull requests.
A guide for adding new presses is available in new_press.ipynb.
Links to relevant "Awesome" lists for KV cache compression are provided.

Licensing & Compatibility

No explicit license mentioned in the README.
Compatible with Hugging Face transformers models, tested with Llama, Mistral, Phi-3, and Qwen2.

Limitations & Caveats

Some presses are model-architecture dependent and may not work with all LLMs.
Flash Attention 2 is recommended but eager attention is required for ObservedAttentionPress.
QuantizedCache requires additional dependencies like optimum-quanto.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

2

Star History

35 stars in the last 30 days

Explore Similar Projects

C3-Context-Cascade-Compression by liufanfanlff

Advanced text compression model

Created 1 month ago

Updated 1 month ago

Toolkit-for-Prompt-Compression by 3DAgentWorld

Prompt compression toolkit for LLM inference efficiency

Created 1 year ago

Updated 11 months ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

llama-zip by AlexBuz

LLM-powered lossless compression tool

Created 1 year ago

Updated 1 week ago

Awesome-KV-Cache-Compression by October2001

List of must-read papers on KV cache compression

Created 1 year ago

Updated 3 months ago

SnapKV by FasterDecoding

KV cache compression research paper

Created 1 year ago

Updated 6 months ago

Starred by

Xiaofan Luan

Xiaofan Luan(VP Engineering at Zilliz).

zipnn by zipnn

Lossless compression library for AI pipelines

Created 1 year ago

Updated 6 months ago

SVD-LLM by AIoT-MLSys-Lab

Compressing LLMs with Singular Value Decomposition

Created 1 year ago

Updated 4 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

DFloat11 by LeanModels

Lossless compression framework for efficient LLM GPU inference

Created 9 months ago

Updated 1 month ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

1 more.

SqueezeLLM by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

Created 2 years ago

Updated 1 year ago

KVCache-Factory by Zefan-Cai

Unified framework for KV cache compression in auto-regressive models

Created 1 year ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

Awesome-LLM-Inference by xlite-dev

Curated list of LLM/VLM inference research papers with code

Created 2 years ago

Updated 1 month ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

7 more.

LLMLingua by microsoft

Prompt compression for accelerated LLM inference

Created 2 years ago

Updated 2 months ago

Feedback? Help us improve.