kvcached by ovg-project

Virtualizing LLM KV cache for elastic GPU sharing

Created 7 months ago

743 stars

Top 46.8% on SourcePulse

Project Summary

kvcached addresses the challenge of inefficient GPU memory utilization in LLM serving, particularly under dynamic workloads. By introducing a virtual memory abstraction for KV caches, it allows LLM systems to elastically allocate and reclaim GPU memory on demand. This benefits engineers and researchers by enabling flexible GPU sharing, cost savings, and practical deployment of serverless LLMs and compound AI systems on limited hardware.

How It Works

The project implements an OS-style virtual memory system for LLM KV caches, decoupling logical cache addresses from physical GPU memory. Serving engines initially reserve virtual memory, which is backed by physical GPU memory only when actively utilized. This on-demand allocation and runtime mapping strategy allows for dynamic memory management, improving GPU utilization and enabling features like frontend routing and idle model sleep modes.

Quick Start & Requirements

Installation is available via PyPI (pip install kvcached) or from source. Docker images are provided for SGLang (ghcr.io/ovg-project/kvcached-sglang:latest) and vLLM (ghcr.io/ovg-project/kvcached-vllm:latest). Prerequisites include Python (3.9-3.12) and compatible versions of SGLang (v0.5.3) or vLLM (v0.11.0). NVIDIA GPUs are implicitly required. Further instructions and examples are available in the repository.

kvcached by ovg-project

Explore Similar Projects

DFloat11 by LeanModels

Nanoflow by efeslab

sarathi-serve by microsoft

ServerlessLLM by ServerlessLLM

candle-vllm by EricLBuehler

amd-strix-halo-toolboxes by kyuz0

gpu_poor by RahulSChand

LiteRT-LM by google-ai-edge

S-LoRA by S-LoRA

FlexLLMGen by FMInference

LMCache by LMCache

ipex-llm by intel