Awesome-KV-Cache-Optimization  by jjiantong

Survey on efficient LLM serving via KV cache optimization

Created 3 months ago
298 stars

Top 89.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive survey of system-aware KV cache optimization techniques for efficient Large Language Model (LLM) serving. It targets researchers and engineers seeking to improve LLM inference performance without modifying model architectures or retraining. The primary benefit is a structured, taxonomy-driven overview of existing methods, their trade-offs, and open research challenges.

How It Works

The survey organizes KV cache optimization strategies into three core behavioral dimensions: Temporal (access/computation timing), Spatial (placement/migration), and Structural (representation/management). This taxonomy facilitates analysis of how these behaviors interact (co-design affinity) and influence key serving objectives like latency, throughput, and memory usage (behavior-objective effects). This systematic approach aims to highlight novel research directions and identify critical open problems in LLM serving efficiency.

Quick Start & Requirements

This repository is a curated list of research papers and does not provide a runnable software tool for direct installation or execution. Contribution instructions are provided for adding new papers via pull requests or issues.

Highlighted Details

  • Organizes KV cache optimization methods into a novel "system behavior-oriented taxonomy" encompassing Temporal, Spatial, and Structural dimensions.
  • Provides in-depth analysis of cross-behavior co-design affinities and behavior-objective effects, mapping techniques to serving goals.
  • Features a comprehensive catalog of papers, often linking to their code implementations where available.
  • Actively maintained and updated, welcoming community contributions.

Maintenance & Community

The survey and repository are under active development and are updated regularly. Contributions of relevant papers are encouraged via pull requests or issues. The primary citation is provided for the survey paper: Jiang et al., "Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization" [DOI: 10.36227/techrxiv.176046306.66521015/v3].

Licensing & Compatibility

No specific software license is mentioned for the repository itself. The content is a survey of research papers, each with its own licensing implications.

Limitations & Caveats

As a research survey, this repository does not offer a deployable system. Its scope is limited to "system-aware, serving-time, KV-centric optimization methods" that do not require model retraining or architectural changes. The content is continuously evolving due to active development.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
100 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.