Discover and explore top open-source AI tools and projects—updated daily.
ovg-projectVirtualizing LLM KV cache for elastic GPU sharing
Top 55.4% on SourcePulse
kvcached addresses the challenge of inefficient GPU memory utilization in LLM serving, particularly under dynamic workloads. By introducing a virtual memory abstraction for KV caches, it allows LLM systems to elastically allocate and reclaim GPU memory on demand. This benefits engineers and researchers by enabling flexible GPU sharing, cost savings, and practical deployment of serverless LLMs and compound AI systems on limited hardware.
How It Works
The project implements an OS-style virtual memory system for LLM KV caches, decoupling logical cache addresses from physical GPU memory. Serving engines initially reserve virtual memory, which is backed by physical GPU memory only when actively utilized. This on-demand allocation and runtime mapping strategy allows for dynamic memory management, improving GPU utilization and enabling features like frontend routing and idle model sleep modes.
Quick Start & Requirements
Installation is available via PyPI (pip install kvcached) or from source. Docker images are provided for SGLang (ghcr.io/ovg-project/kvcached-sglang:latest) and vLLM (ghcr.io/ovg-project/kvcached-vllm:latest). Prerequisites include Python (3.9-3.12) and compatible versions of SGLang (v0.5.3) or vLLM (v0.11.0). NVIDIA GPUs are implicitly required. Further instructions and examples are available in the repository.
Highlighted Details
17 hours ago
Inactive
S-LoRA
FMInference
LMCache