SnapKV by FasterDecoding

KV cache compression research paper

Created 1 year ago

299 stars

Top 89.1% on SourcePulse

Project Summary

SnapKV offers a novel KV cache compression method to enhance the efficiency of Large Language Models (LLMs) during inference. It targets researchers and engineers working with LLMs who need to reduce memory footprint and improve speed, particularly for models like Llama, Mistral, and Mixtral.

How It Works

SnapKV employs a compression technique that intelligently identifies and retains salient KV cache entries, discarding less critical information. This approach aims to maintain generation quality while significantly reducing memory usage, enabling longer context windows or larger batch sizes.

Quick Start & Requirements

Installation: pip install -e . after cloning the repository.
Requirements: transformers>=4.36, flash-attn==2.4.0.
Usage: Monkey-patching existing models (e.g., replace_mistral()) or integrating via marked comments within model code. An example notebook is available.

Highlighted Details

Out-of-the-box KV cache compression.
Supports Llama, Mistral, and Mixtral families.
Focuses on memory reduction and inference speed.

Maintenance & Community

The project appears to be actively developed, with a clear citation provided for its methodology. Further community engagement channels (e.g., Discord, Slack) are not specified in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility with commercial or closed-source applications is not detailed.

Limitations & Caveats

The project is marked with "TODO" items, indicating ongoing development and potential missing features or experimental results. Compatibility with transformers versions higher than 4.37.0 requires verification.

SnapKV by FasterDecoding

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

Awesome-KV-Cache-Compression by October2001

Awesome-LLM-KV-Cache by Zefan-Cai

InfiniStore by bytedance

DFloat11 by LeanModels

KIVI by jy-yuan

SepLLM by HKUDS

TransformerCompression by microsoft

H2O by FMInference

kvpress by NVIDIA

KVCache-Factory by Zefan-Cai

DeepSpeed-MII by deepspeedai