sarathi-serve  by microsoft

LLM serving engine for low-latency & high-throughput inference (OSDI'24 paper)

Created 1 year ago
417 stars

Top 70.4% on SourcePulse

GitHubView on GitHub
Project Summary

Sarathi-Serve is a research-oriented LLM serving framework designed for low-latency and high-throughput inference. It targets researchers and engineers working with large language models who need efficient deployment, building upon the foundation of the vLLM project.

How It Works

Sarathi-Serve prioritizes throughput and latency through a specialized serving engine. While specific architectural details are deferred to its OSDI'24 paper, its origin as a fork of vLLM suggests an underlying PagedAttention mechanism for efficient memory management and KV cache handling, a common approach for optimizing LLM inference.

Quick Start & Requirements

  • Install: pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/
  • Prerequisites: CUDA 12.3, H100/A100 GPUs, Python 3.10.
  • Setup: Requires cloning the repository and creating a mamba environment.

Highlighted Details

  • Optimized for low-latency and high-throughput LLM serving.
  • Based on the vLLM project, retaining critical features for research iteration.
  • Tested with CUDA 12.3 on H100 and A100 GPUs.

Maintenance & Community

This project is a research prototype from Microsoft. Further community engagement details are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. As a fork of vLLM (Apache 2.0), it may inherit similar licensing, but this requires verification. Compatibility for commercial use is not specified.

Limitations & Caveats

Sarathi-Serve is a research prototype and does not have complete feature parity with open-source vLLM. It has been adapted for faster research iterations, implying potential limitations in production-readiness or feature completeness compared to more mature frameworks.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 1 year ago
Updated 1 year ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
3 more.

minions by HazyResearch

1.3%
1k
Communication protocol for cost-efficient LLM collaboration
Created 7 months ago
Updated 18 hours ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

3.5%
5k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 15 hours ago
Feedback? Help us improve.