sarathi-serve  by microsoft

LLM serving engine for low-latency & high-throughput inference (OSDI'24 paper)

created 1 year ago
399 stars

Top 73.5% on sourcepulse

GitHubView on GitHub
Project Summary

Sarathi-Serve is a research-oriented LLM serving framework designed for low-latency and high-throughput inference. It targets researchers and engineers working with large language models who need efficient deployment, building upon the foundation of the vLLM project.

How It Works

Sarathi-Serve prioritizes throughput and latency through a specialized serving engine. While specific architectural details are deferred to its OSDI'24 paper, its origin as a fork of vLLM suggests an underlying PagedAttention mechanism for efficient memory management and KV cache handling, a common approach for optimizing LLM inference.

Quick Start & Requirements

  • Install: pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/
  • Prerequisites: CUDA 12.3, H100/A100 GPUs, Python 3.10.
  • Setup: Requires cloning the repository and creating a mamba environment.

Highlighted Details

  • Optimized for low-latency and high-throughput LLM serving.
  • Based on the vLLM project, retaining critical features for research iteration.
  • Tested with CUDA 12.3 on H100 and A100 GPUs.

Maintenance & Community

This project is a research prototype from Microsoft. Further community engagement details are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. As a fork of vLLM (Apache 2.0), it may inherit similar licensing, but this requires verification. Compatibility for commercial use is not specified.

Limitations & Caveats

Sarathi-Serve is a research prototype and does not have complete feature parity with open-source vLLM. It has been adapted for faster research iterations, implying potential limitations in production-readiness or feature completeness compared to more mature frameworks.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
51 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.7%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 20 hours ago
Feedback? Help us improve.