Mooncake by kvcache-ai

Research paper on a disaggregated architecture for LLM serving

Created 1 year ago

4,552 stars

Top 10.7% on SourcePulse

View on GitHub

7 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Luis Capelo

Cofounder of Lightning AI

Lianmin Zheng

Coauthor of SGLang, vLLM

Woosuk Kwon

Coauthor of vLLM

and 3 more!

Project Summary

Mooncake is an LLM serving platform designed for efficient long-context inference by disaggregating prefill and decoding clusters and leveraging a KVCache-centric architecture. It targets LLM service providers and researchers seeking to maximize throughput and meet latency SLOs, particularly in demanding, overloaded scenarios.

How It Works

Mooncake employs a disaggregated architecture separating prefill and decoding clusters, with a novel KVCache-centric scheduler. It utilizes underutilized CPU, DRAM, and SSD resources for a disaggregated cache. The core innovation lies in its KVCache-centric scheduler, which balances throughput maximization with latency SLOs, and a prediction-based early rejection policy to handle overloaded conditions.

Quick Start & Requirements

Install: pip install mooncake-transfer-engine
Prerequisites: RDMA Driver & SDK (e.g., Mellanox OFED), Python 3.10+, CUDA 12.1+ (with GPUDirect Storage Support if building with -DUSE_CUDA).
Build Dependencies: GCC 9.4+, CMake 3.16+, Go 1.20+ (for P2P Store/etcd), Rust (optional), hiredis (optional), curl (optional).
Setup: Recommended to use RDMA for optimal performance. Docker deployment is supported.
Docs: Transfer Engine Guide, P2P Store Guide, Mooncake Store Guide, vLLM Integration Guide v0.2

Highlighted Details

Achieves up to 525% throughput increase in simulated long-context scenarios compared to baselines.
Transfer Engine offers high-performance data transfer over TCP, RDMA, and NVMe-oF, with up to 4.6x faster bandwidth than TCP on high-speed networks.
vLLM integration with Transfer Engine shows up to 25% lower Mean TTFT than TCP-based transports.
P2P Store enables efficient, decentralized sharing of temporary objects like checkpoints.

Maintenance & Community

Open-sourced components include Transfer Engine and Mooncake Store.
Integrations with vLLM and SGLang are available.
Awarded Best Paper at FAST 2025.
Open Source Trace available.

Licensing & Compatibility

The repository does not explicitly state a license for the core project or its components. The README mentions open-sourcing components but lacks a LICENSE file.

Limitations & Caveats

Optimal performance is heavily dependent on RDMA network setup.
The project's licensing status is unclear, which may impact commercial adoption.
Some features, like the new vLLM integration with Mooncake Store, are noted as "coming soon."

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

131

Issues (30d)

Star History

154 stars in the last 30 days