Mooncake is an LLM serving platform designed for efficient long-context inference by disaggregating prefill and decoding clusters and leveraging a KVCache-centric architecture. It targets LLM service providers and researchers seeking to maximize throughput and meet latency SLOs, particularly in demanding, overloaded scenarios.
How It Works
Mooncake employs a disaggregated architecture separating prefill and decoding clusters, with a novel KVCache-centric scheduler. It utilizes underutilized CPU, DRAM, and SSD resources for a disaggregated cache. The core innovation lies in its KVCache-centric scheduler, which balances throughput maximization with latency SLOs, and a prediction-based early rejection policy to handle overloaded conditions.
Quick Start & Requirements
- Install:
pip install mooncake-transfer-engine
- Prerequisites: RDMA Driver & SDK (e.g., Mellanox OFED), Python 3.10+, CUDA 12.1+ (with GPUDirect Storage Support if building with
-DUSE_CUDA
).
- Build Dependencies: GCC 9.4+, CMake 3.16+, Go 1.20+ (for P2P Store/etcd), Rust (optional), hiredis (optional), curl (optional).
- Setup: Recommended to use RDMA for optimal performance. Docker deployment is supported.
- Docs: Transfer Engine Guide, P2P Store Guide, Mooncake Store Guide, vLLM Integration Guide v0.2
Highlighted Details
- Achieves up to 525% throughput increase in simulated long-context scenarios compared to baselines.
- Transfer Engine offers high-performance data transfer over TCP, RDMA, and NVMe-oF, with up to 4.6x faster bandwidth than TCP on high-speed networks.
- vLLM integration with Transfer Engine shows up to 25% lower Mean TTFT than TCP-based transports.
- P2P Store enables efficient, decentralized sharing of temporary objects like checkpoints.
Maintenance & Community
- Open-sourced components include Transfer Engine and Mooncake Store.
- Integrations with vLLM and SGLang are available.
- Awarded Best Paper at FAST 2025.
- Open Source Trace available.
Licensing & Compatibility
- The repository does not explicitly state a license for the core project or its components. The README mentions open-sourcing components but lacks a LICENSE file.
Limitations & Caveats
- Optimal performance is heavily dependent on RDMA network setup.
- The project's licensing status is unclear, which may impact commercial adoption.
- Some features, like the new vLLM integration with Mooncake Store, are noted as "coming soon."