llumnix  by AlibabaPAI

Request scheduling layer for multi-instance LLM serving (research paper)

Created 1 year ago
504 stars

Top 61.7% on SourcePulse

GitHubView on GitHub
Project Summary

Llumnix is a cross-instance request scheduling layer designed to optimize multi-instance Large Language Model (LLM) serving. It targets users of LLM inference engines like vLLM, aiming to reduce latency (TTFT, TBT) and increase throughput through advanced scheduling techniques.

How It Works

Llumnix operates by dynamically scheduling requests across multiple LLM inference engine instances. Its core innovation lies in a KV cache migration mechanism with near-zero overhead, enabling continuous load balancing, memory de-fragmentation, and prefill-decode disaggregation. This fine-grained, KV-cache-aware scheduling allows for more efficient resource utilization and reduced queuing delays compared to simpler scheduling methods.

Quick Start & Requirements

Highlighted Details

  • Outperforms round-robin scheduling by up to 6.4x in mean TTFT and 12.1x in P99 TTFT.
  • Achieves 12% P99 TBT improvement over round-robin.
  • Reduces average preemption stalls by two orders of magnitude.
  • Supports prefill-decode disaggregation and KV-cache migration.

Maintenance & Community

Llumnix is an alpha-stage project with planned roadmap items including architectural improvements, policy optimization, and new features. The project is associated with Alibaba.

Licensing & Compatibility

Licensed under the Apache 2.0 License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Llumnix is currently in alpha, with ongoing development planned for scalability, efficiency, and new features. The project's roadmap indicates a focus on further engineering and testing.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

production-stack by vllm-project

0.8%
2k
Reference stack for production vLLM deployment on Kubernetes
Created 9 months ago
Updated 4 days ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

2.7%
6k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 15 hours ago
Feedback? Help us improve.