llumnix  by AlibabaPAI

Request scheduling layer for multi-instance LLM serving (research paper)

Created 1 year ago
487 stars

Top 63.2% on SourcePulse

GitHubView on GitHub
Project Summary

Llumnix is a cross-instance request scheduling layer designed to optimize multi-instance Large Language Model (LLM) serving. It targets users of LLM inference engines like vLLM, aiming to reduce latency (TTFT, TBT) and increase throughput through advanced scheduling techniques.

How It Works

Llumnix operates by dynamically scheduling requests across multiple LLM inference engine instances. Its core innovation lies in a KV cache migration mechanism with near-zero overhead, enabling continuous load balancing, memory de-fragmentation, and prefill-decode disaggregation. This fine-grained, KV-cache-aware scheduling allows for more efficient resource utilization and reduced queuing delays compared to simpler scheduling methods.

Quick Start & Requirements

Highlighted Details

  • Outperforms round-robin scheduling by up to 6.4x in mean TTFT and 12.1x in P99 TTFT.
  • Achieves 12% P99 TBT improvement over round-robin.
  • Reduces average preemption stalls by two orders of magnitude.
  • Supports prefill-decode disaggregation and KV-cache migration.

Maintenance & Community

Llumnix is an alpha-stage project with planned roadmap items including architectural improvements, policy optimization, and new features. The project is associated with Alibaba.

Licensing & Compatibility

Licensed under the Apache 2.0 License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Llumnix is currently in alpha, with ongoing development planned for scalability, efficiency, and new features. The project's roadmap indicates a focus on further engineering and testing.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Philipp Schmid Philipp Schmid(DevRel at Google DeepMind).

production-stack by vllm-project

1.0%
2k
Reference stack for production vLLM deployment on Kubernetes
Created 8 months ago
Updated 2 days ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

3.5%
5k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 15 hours ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 15 hours ago
Feedback? Help us improve.