LLM serving engine extension for reduced TTFT and increased throughput
Top 13.4% on sourcepulse
LMCache is an extension for LLM serving engines, primarily vLLM, designed to reduce Time To First Token (TTFT) and increase throughput, particularly in long-context scenarios. It achieves this by caching and reusing Key-Value (KV) caches across different locations (GPU, CPU DRAM, local disk) and serving instances, even for non-prefix text segments. This approach saves GPU resources and lowers response latency for applications like multi-round QA and RAG.
How It Works
LMCache implements a distributed KV cache storage and retrieval system. It intelligently offloads KV caches to CPU DRAM and local disk, and enables peer-to-peer sharing of these caches among serving instances. This disaggregated approach allows for efficient reuse of previously computed KV states, significantly reducing redundant computations, especially when dealing with repeated or similar text segments within conversations or documents.
Quick Start & Requirements
pip install vllm
(LMCache is integrated into vLLM).Highlighted Details
Maintenance & Community
CONTRIBUTING.md
.Licensing & Compatibility
Limitations & Caveats
The project is actively developed with multiple research papers published, indicating ongoing work and potential for rapid evolution. Specific performance gains are highly dependent on the workload and LLM architecture.
17 hours ago
1 day