llumnix by AlibabaPAI

Request scheduling layer for multi-instance LLM serving (research paper)

Created 1 year ago

520 stars

Top 60.4% on SourcePulse

Project Summary

Llumnix is a cross-instance request scheduling layer designed to optimize multi-instance Large Language Model (LLM) serving. It targets users of LLM inference engines like vLLM, aiming to reduce latency (TTFT, TBT) and increase throughput through advanced scheduling techniques.

How It Works

Llumnix operates by dynamically scheduling requests across multiple LLM inference engine instances. Its core innovation lies in a KV cache migration mechanism with near-zero overhead, enabling continuous load balancing, memory de-fragmentation, and prefill-decode disaggregation. This fine-grained, KV-cache-aware scheduling allows for more efficient resource utilization and reduced queuing delays compared to simpler scheduling methods.

Quick Start & Requirements

Install/Run: Replace vLLM serving command with python -m llumnix.entrypoints.vllm.api_server .... For Ray deployment, use serve entrypoint.
Prerequisites: vLLM, Python. Specific hardware requirements depend on the LLM models being served.
Documentation: Quick Start, Supported Models, Fault Tolerance, Simulator, Prefill-decode Disaggregation.

Highlighted Details

Outperforms round-robin scheduling by up to 6.4x in mean TTFT and 12.1x in P99 TTFT.
Achieves 12% P99 TBT improvement over round-robin.
Reduces average preemption stalls by two orders of magnitude.
Supports prefill-decode disaggregation and KV-cache migration.

Maintenance & Community

Llumnix is an alpha-stage project with planned roadmap items including architectural improvements, policy optimization, and new features. The project is associated with Alibaba.

Licensing & Compatibility

Licensed under the Apache 2.0 License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Llumnix is currently in alpha, with ongoing development planned for scalability, efficiency, and new features. The project's roadmap indicates a focus on further engineering and testing.

llumnix by AlibabaPAI

Explore Similar Projects

InfiniStore by bytedance

sarathi-serve by microsoft

candle-vllm by EricLBuehler

DistServe by LLMServe

kubeai by kubeai-project

bistro by facebookarchive

llm-d by llm-d

production-stack by vllm-project

Mooncake by kvcache-ai

aibrix by vllm-project

dynamo by ai-dynamo

LMCache by LMCache