sarathi-serve  by microsoft

LLM serving engine for low-latency & high-throughput inference (OSDI'24 paper)

Created 2 years ago
476 stars

Top 64.3% on SourcePulse

GitHubView on GitHub
Project Summary

Sarathi-Serve is a research-oriented LLM serving framework designed for low-latency and high-throughput inference. It targets researchers and engineers working with large language models who need efficient deployment, building upon the foundation of the vLLM project.

How It Works

Sarathi-Serve prioritizes throughput and latency through a specialized serving engine. While specific architectural details are deferred to its OSDI'24 paper, its origin as a fork of vLLM suggests an underlying PagedAttention mechanism for efficient memory management and KV cache handling, a common approach for optimizing LLM inference.

Quick Start & Requirements

  • Install: pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/
  • Prerequisites: CUDA 12.3, H100/A100 GPUs, Python 3.10.
  • Setup: Requires cloning the repository and creating a mamba environment.

Highlighted Details

  • Optimized for low-latency and high-throughput LLM serving.
  • Based on the vLLM project, retaining critical features for research iteration.
  • Tested with CUDA 12.3 on H100 and A100 GPUs.

Maintenance & Community

This project is a research prototype from Microsoft. Further community engagement details are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. As a fork of vLLM (Apache 2.0), it may inherit similar licensing, but this requires verification. Compatibility for commercial use is not specified.

Limitations & Caveats

Sarathi-Serve is a research prototype and does not have complete feature parity with open-source vLLM. It has been adapted for faster research iterations, implying potential limitations in production-readiness or feature completeness compared to more mature frameworks.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.4%
7k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 18 hours ago
Feedback? Help us improve.