kvcached  by ovg-project

Virtualizing LLM KV cache for elastic GPU sharing

Created 5 months ago
585 stars

Top 55.4% on SourcePulse

GitHubView on GitHub
Project Summary

kvcached addresses the challenge of inefficient GPU memory utilization in LLM serving, particularly under dynamic workloads. By introducing a virtual memory abstraction for KV caches, it allows LLM systems to elastically allocate and reclaim GPU memory on demand. This benefits engineers and researchers by enabling flexible GPU sharing, cost savings, and practical deployment of serverless LLMs and compound AI systems on limited hardware.

How It Works

The project implements an OS-style virtual memory system for LLM KV caches, decoupling logical cache addresses from physical GPU memory. Serving engines initially reserve virtual memory, which is backed by physical GPU memory only when actively utilized. This on-demand allocation and runtime mapping strategy allows for dynamic memory management, improving GPU utilization and enabling features like frontend routing and idle model sleep modes.

Quick Start & Requirements

Installation is available via PyPI (pip install kvcached) or from source. Docker images are provided for SGLang (ghcr.io/ovg-project/kvcached-sglang:latest) and vLLM (ghcr.io/ovg-project/kvcached-vllm:latest). Prerequisites include Python (3.9-3.12) and compatible versions of SGLang (v0.5.3) or vLLM (v0.11.0). NVIDIA GPUs are implicitly required. Further instructions and examples are available in the repository.

Highlighted Details

  • Elastic KV Cache: Dynamically allocates and reclaims KV cache memory based on live load.

Health Check
Last Commit

17 hours ago

Responsiveness

Inactive

Pull Requests (30d)
39
Issues (30d)
20
Star History
495 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 1 year ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

2.7%
6k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.