SnapKV  by FasterDecoding

KV cache compression research paper

created 1 year ago
268 stars

Top 96.5% on sourcepulse

GitHubView on GitHub
Project Summary

SnapKV offers a novel KV cache compression method to enhance the efficiency of Large Language Models (LLMs) during inference. It targets researchers and engineers working with LLMs who need to reduce memory footprint and improve speed, particularly for models like Llama, Mistral, and Mixtral.

How It Works

SnapKV employs a compression technique that intelligently identifies and retains salient KV cache entries, discarding less critical information. This approach aims to maintain generation quality while significantly reducing memory usage, enabling longer context windows or larger batch sizes.

Quick Start & Requirements

  • Installation: pip install -e . after cloning the repository.
  • Requirements: transformers>=4.36, flash-attn==2.4.0.
  • Usage: Monkey-patching existing models (e.g., replace_mistral()) or integrating via marked comments within model code. An example notebook is available.

Highlighted Details

  • Out-of-the-box KV cache compression.
  • Supports Llama, Mistral, and Mixtral families.
  • Focuses on memory reduction and inference speed.

Maintenance & Community

The project appears to be actively developed, with a clear citation provided for its methodology. Further community engagement channels (e.g., Discord, Slack) are not specified in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility with commercial or closed-source applications is not detailed.

Limitations & Caveats

The project is marked with "TODO" items, indicating ongoing development and potential missing features or experimental results. Compatibility with transformers versions higher than 4.37.0 requires verification.

Health Check
Last commit

3 weeks ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.