List of must-read papers on KV cache compression
Top 63.4% on sourcepulse
This repository is a curated collection of research papers focused on KV cache compression techniques for Large Language Models (LLMs). It aims to provide a comprehensive overview of methods for optimizing LLM inference efficiency, targeting researchers and engineers working on LLM acceleration and memory optimization.
How It Works
The repository categorizes papers into distinct approaches for KV cache compression, including pruning/evicting tokens, merging KV cache entries across layers, low-rank approximations, quantization, and prompt compression. This structured organization allows users to quickly identify and explore specific optimization strategies and their underlying methodologies.
Quick Start & Requirements
This is a curated list of papers and does not involve code execution.
Highlighted Details
Maintenance & Community
The project is actively maintained, with the last commit indicated as recent. It welcomes contributions via Pull Requests.
Licensing & Compatibility
The repository is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
This repository is a literature collection and does not provide implementations or benchmarks of the discussed methods. Users must refer to individual papers for implementation details and performance evaluations.
3 days ago
1 day