llumnix  by AlibabaPAI

Request scheduling layer for multi-instance LLM serving (research paper)

created 1 year ago
454 stars

Top 67.5% on sourcepulse

GitHubView on GitHub
Project Summary

Llumnix is a cross-instance request scheduling layer designed to optimize multi-instance Large Language Model (LLM) serving. It targets users of LLM inference engines like vLLM, aiming to reduce latency (TTFT, TBT) and increase throughput through advanced scheduling techniques.

How It Works

Llumnix operates by dynamically scheduling requests across multiple LLM inference engine instances. Its core innovation lies in a KV cache migration mechanism with near-zero overhead, enabling continuous load balancing, memory de-fragmentation, and prefill-decode disaggregation. This fine-grained, KV-cache-aware scheduling allows for more efficient resource utilization and reduced queuing delays compared to simpler scheduling methods.

Quick Start & Requirements

Highlighted Details

  • Outperforms round-robin scheduling by up to 6.4x in mean TTFT and 12.1x in P99 TTFT.
  • Achieves 12% P99 TBT improvement over round-robin.
  • Reduces average preemption stalls by two orders of magnitude.
  • Supports prefill-decode disaggregation and KV-cache migration.

Maintenance & Community

Llumnix is an alpha-stage project with planned roadmap items including architectural improvements, policy optimization, and new features. The project is associated with Alibaba.

Licensing & Compatibility

Licensed under the Apache 2.0 License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Llumnix is currently in alpha, with ongoing development planned for scalability, efficiency, and new features. The project's roadmap indicates a focus on further engineering and testing.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
40
Issues (30d)
2
Star History
58 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.8%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 10 hours ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Robert Nishihara Robert Nishihara(Cofounder of Anyscale; Author of Ray), and
4 more.

verl by volcengine

2.3%
12k
RL training library for LLMs
created 9 months ago
updated 5 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 5 hours ago
Feedback? Help us improve.