LMCache  by LMCache

LLM serving engine extension for reduced TTFT and increased throughput

created 1 year ago
3,683 stars

Top 13.4% on sourcepulse

GitHubView on GitHub
Project Summary

LMCache is an extension for LLM serving engines, primarily vLLM, designed to reduce Time To First Token (TTFT) and increase throughput, particularly in long-context scenarios. It achieves this by caching and reusing Key-Value (KV) caches across different locations (GPU, CPU DRAM, local disk) and serving instances, even for non-prefix text segments. This approach saves GPU resources and lowers response latency for applications like multi-round QA and RAG.

How It Works

LMCache implements a distributed KV cache storage and retrieval system. It intelligently offloads KV caches to CPU DRAM and local disk, and enables peer-to-peer sharing of these caches among serving instances. This disaggregated approach allows for efficient reuse of previously computed KV states, significantly reducing redundant computations, especially when dealing with repeated or similar text segments within conversations or documents.

Quick Start & Requirements

  • Install via pip install vllm (LMCache is integrated into vLLM).
  • Requires integration with vLLM. Refer to vLLM Docker images for pre-built images.
  • Detailed documentation is available online.

Highlighted Details

  • Achieves 3-10x delay savings and GPU cycle reduction when combined with vLLM.
  • Supports CPU KVCache offloading, disaggregated prefill, and P2P KVCache sharing.
  • Stable support for non-prefix KV caches.
  • Integrated into the vLLM production stack ecosystem.

Maintenance & Community

  • Weekly community meetings are held on Tuesdays at 9:00 AM PT and 6:30 PM PT, alternating weekly.
  • Contributions are welcomed; see CONTRIBUTING.md.
  • Join the community via Slack.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is actively developed with multiple research papers published, indicating ongoing work and potential for rapid evolution. Specific performance gains are highly dependent on the workload and LLM architecture.

Health Check
Last commit

17 hours ago

Responsiveness

1 day

Pull Requests (30d)
161
Issues (30d)
104
Star History
2,934 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 17 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.