LMCache  by LMCache

LLM serving engine extension for reduced TTFT and increased throughput

Created 1 year ago
6,666 stars

Top 7.6% on SourcePulse

GitHubView on GitHub
Project Summary

LMCache is an extension for LLM serving engines, primarily vLLM, designed to reduce Time To First Token (TTFT) and increase throughput, particularly in long-context scenarios. It achieves this by caching and reusing Key-Value (KV) caches across different locations (GPU, CPU DRAM, local disk) and serving instances, even for non-prefix text segments. This approach saves GPU resources and lowers response latency for applications like multi-round QA and RAG.

How It Works

LMCache implements a distributed KV cache storage and retrieval system. It intelligently offloads KV caches to CPU DRAM and local disk, and enables peer-to-peer sharing of these caches among serving instances. This disaggregated approach allows for efficient reuse of previously computed KV states, significantly reducing redundant computations, especially when dealing with repeated or similar text segments within conversations or documents.

Quick Start & Requirements

  • Install via pip install vllm (LMCache is integrated into vLLM).
  • Requires integration with vLLM. Refer to vLLM Docker images for pre-built images.
  • Detailed documentation is available online.

Highlighted Details

  • Achieves 3-10x delay savings and GPU cycle reduction when combined with vLLM.
  • Supports CPU KVCache offloading, disaggregated prefill, and P2P KVCache sharing.
  • Stable support for non-prefix KV caches.
  • Integrated into the vLLM production stack ecosystem.

Maintenance & Community

  • Weekly community meetings are held on Tuesdays at 9:00 AM PT and 6:30 PM PT, alternating weekly.
  • Contributions are welcomed; see CONTRIBUTING.md.
  • Join the community via Slack.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is actively developed with multiple research papers published, indicating ongoing work and potential for rapid evolution. Specific performance gains are highly dependent on the workload and LLM architecture.

Health Check
Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
126
Issues (30d)
149
Star History
346 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.