kvcached  by ovg-project

Virtualizing LLM KV cache for elastic GPU sharing

Created 7 months ago
743 stars

Top 46.8% on SourcePulse

GitHubView on GitHub
Project Summary

kvcached addresses the challenge of inefficient GPU memory utilization in LLM serving, particularly under dynamic workloads. By introducing a virtual memory abstraction for KV caches, it allows LLM systems to elastically allocate and reclaim GPU memory on demand. This benefits engineers and researchers by enabling flexible GPU sharing, cost savings, and practical deployment of serverless LLMs and compound AI systems on limited hardware.

How It Works

The project implements an OS-style virtual memory system for LLM KV caches, decoupling logical cache addresses from physical GPU memory. Serving engines initially reserve virtual memory, which is backed by physical GPU memory only when actively utilized. This on-demand allocation and runtime mapping strategy allows for dynamic memory management, improving GPU utilization and enabling features like frontend routing and idle model sleep modes.

Quick Start & Requirements

Installation is available via PyPI (pip install kvcached) or from source. Docker images are provided for SGLang (ghcr.io/ovg-project/kvcached-sglang:latest) and vLLM (ghcr.io/ovg-project/kvcached-vllm:latest). Prerequisites include Python (3.9-3.12) and compatible versions of SGLang (v0.5.3) or vLLM (v0.11.0). NVIDIA GPUs are implicitly required. Further instructions and examples are available in the repository.

Highlighted Details

  • Elastic KV Cache: Dynamically allocates and reclaims KV cache memory based on live load.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
3
Star History
37 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.7%
7k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 17 hours ago
Feedback? Help us improve.