Mooncake  by kvcache-ai

Research paper on a disaggregated architecture for LLM serving

created 1 year ago
3,675 stars

Top 13.5% on sourcepulse

GitHubView on GitHub
Project Summary

Mooncake is an LLM serving platform designed for efficient long-context inference by disaggregating prefill and decoding clusters and leveraging a KVCache-centric architecture. It targets LLM service providers and researchers seeking to maximize throughput and meet latency SLOs, particularly in demanding, overloaded scenarios.

How It Works

Mooncake employs a disaggregated architecture separating prefill and decoding clusters, with a novel KVCache-centric scheduler. It utilizes underutilized CPU, DRAM, and SSD resources for a disaggregated cache. The core innovation lies in its KVCache-centric scheduler, which balances throughput maximization with latency SLOs, and a prediction-based early rejection policy to handle overloaded conditions.

Quick Start & Requirements

  • Install: pip install mooncake-transfer-engine
  • Prerequisites: RDMA Driver & SDK (e.g., Mellanox OFED), Python 3.10+, CUDA 12.1+ (with GPUDirect Storage Support if building with -DUSE_CUDA).
  • Build Dependencies: GCC 9.4+, CMake 3.16+, Go 1.20+ (for P2P Store/etcd), Rust (optional), hiredis (optional), curl (optional).
  • Setup: Recommended to use RDMA for optimal performance. Docker deployment is supported.
  • Docs: Transfer Engine Guide, P2P Store Guide, Mooncake Store Guide, vLLM Integration Guide v0.2

Highlighted Details

  • Achieves up to 525% throughput increase in simulated long-context scenarios compared to baselines.
  • Transfer Engine offers high-performance data transfer over TCP, RDMA, and NVMe-oF, with up to 4.6x faster bandwidth than TCP on high-speed networks.
  • vLLM integration with Transfer Engine shows up to 25% lower Mean TTFT than TCP-based transports.
  • P2P Store enables efficient, decentralized sharing of temporary objects like checkpoints.

Maintenance & Community

  • Open-sourced components include Transfer Engine and Mooncake Store.
  • Integrations with vLLM and SGLang are available.
  • Awarded Best Paper at FAST 2025.
  • Open Source Trace available.

Licensing & Compatibility

  • The repository does not explicitly state a license for the core project or its components. The README mentions open-sourcing components but lacks a LICENSE file.

Limitations & Caveats

  • Optimal performance is heavily dependent on RDMA network setup.
  • The project's licensing status is unclear, which may impact commercial adoption.
  • Some features, like the new vLLM integration with Mooncake Store, are noted as "coming soon."
Health Check
Last commit

15 hours ago

Responsiveness

1 day

Pull Requests (30d)
98
Issues (30d)
30
Star History
529 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 17 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.