Discover and explore top open-source AI tools and projects—updated daily.
jjiantongSurvey on efficient LLM serving via KV cache optimization
Top 89.4% on SourcePulse
This repository serves as a comprehensive survey of system-aware KV cache optimization techniques for efficient Large Language Model (LLM) serving. It targets researchers and engineers seeking to improve LLM inference performance without modifying model architectures or retraining. The primary benefit is a structured, taxonomy-driven overview of existing methods, their trade-offs, and open research challenges.
How It Works
The survey organizes KV cache optimization strategies into three core behavioral dimensions: Temporal (access/computation timing), Spatial (placement/migration), and Structural (representation/management). This taxonomy facilitates analysis of how these behaviors interact (co-design affinity) and influence key serving objectives like latency, throughput, and memory usage (behavior-objective effects). This systematic approach aims to highlight novel research directions and identify critical open problems in LLM serving efficiency.
Quick Start & Requirements
This repository is a curated list of research papers and does not provide a runnable software tool for direct installation or execution. Contribution instructions are provided for adding new papers via pull requests or issues.
Highlighted Details
Maintenance & Community
The survey and repository are under active development and are updated regularly. Contributions of relevant papers are encouraged via pull requests or issues. The primary citation is provided for the survey paper: Jiang et al., "Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization" [DOI: 10.36227/techrxiv.176046306.66521015/v3].
Licensing & Compatibility
No specific software license is mentioned for the repository itself. The content is a survey of research papers, each with its own licensing implications.
Limitations & Caveats
As a research survey, this repository does not offer a deployable system. Its scope is limited to "system-aware, serving-time, KV-centric optimization methods" that do not require model retraining or architectural changes. The content is continuously evolving due to active development.
1 week ago
Inactive
InferenceMAX