LMCache by LMCache

LLM serving engine extension for reduced TTFT and increased throughput

Created 1 year ago

6,666 stars

Top 7.6% on SourcePulse

View on GitHub

6 Experts Love This Project

Taranjeet Singh

Cofounder of Mem0

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Luis Capelo

Cofounder of Lightning AI

and 2 more!

Project Summary

LMCache is an extension for LLM serving engines, primarily vLLM, designed to reduce Time To First Token (TTFT) and increase throughput, particularly in long-context scenarios. It achieves this by caching and reusing Key-Value (KV) caches across different locations (GPU, CPU DRAM, local disk) and serving instances, even for non-prefix text segments. This approach saves GPU resources and lowers response latency for applications like multi-round QA and RAG.

How It Works

LMCache implements a distributed KV cache storage and retrieval system. It intelligently offloads KV caches to CPU DRAM and local disk, and enables peer-to-peer sharing of these caches among serving instances. This disaggregated approach allows for efficient reuse of previously computed KV states, significantly reducing redundant computations, especially when dealing with repeated or similar text segments within conversations or documents.

Quick Start & Requirements

Install via pip install vllm (LMCache is integrated into vLLM).
Requires integration with vLLM. Refer to vLLM Docker images for pre-built images.
Detailed documentation is available online.

Highlighted Details

Achieves 3-10x delay savings and GPU cycle reduction when combined with vLLM.
Supports CPU KVCache offloading, disaggregated prefill, and P2P KVCache sharing.
Stable support for non-prefix KV caches.
Integrated into the vLLM production stack ecosystem.

Maintenance & Community

Weekly community meetings are held on Tuesdays at 9:00 AM PT and 6:30 PM PT, alternating weekly.
Contributions are welcomed; see CONTRIBUTING.md.
Join the community via Slack.

Licensing & Compatibility

Licensed under Apache License 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is actively developed with multiple research papers published, indicating ongoing work and potential for rapid evolution. Specific performance gains are highly dependent on the workload and LLM architecture.

Health Check

Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)

126

Issues (30d)

149

Star History

346 stars in the last 30 days