sarathi-serve by microsoft

LLM serving engine for low-latency & high-throughput inference (OSDI'24 paper)

Created 2 years ago

464 stars

Top 65.3% on SourcePulse

Project Summary

Sarathi-Serve is a research-oriented LLM serving framework designed for low-latency and high-throughput inference. It targets researchers and engineers working with large language models who need efficient deployment, building upon the foundation of the vLLM project.

How It Works

Sarathi-Serve prioritizes throughput and latency through a specialized serving engine. While specific architectural details are deferred to its OSDI'24 paper, its origin as a fork of vLLM suggests an underlying PagedAttention mechanism for efficient memory management and KV cache handling, a common approach for optimizing LLM inference.

Quick Start & Requirements

Install: pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/
Prerequisites: CUDA 12.3, H100/A100 GPUs, Python 3.10.
Setup: Requires cloning the repository and creating a mamba environment.

Highlighted Details

Optimized for low-latency and high-throughput LLM serving.
Based on the vLLM project, retaining critical features for research iteration.
Tested with CUDA 12.3 on H100 and A100 GPUs.

Maintenance & Community

This project is a research prototype from Microsoft. Further community engagement details are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. As a fork of vLLM (Apache 2.0), it may inherit similar licensing, but this requires verification. Compatibility for commercial use is not specified.

Limitations & Caveats

Sarathi-Serve is a research prototype and does not have complete feature parity with open-source vLLM. It has been adapted for faster research iterations, implying potential limitations in production-readiness or feature completeness compared to more mature frameworks.

sarathi-serve by microsoft

Explore Similar Projects

Nanoflow by efeslab

JetStream by AI-Hypercomputer

ServerlessLLM by ServerlessLLM

candle-vllm by EricLBuehler

prima.cpp by Lizonghang

xFasterTransformer by intel

amd-strix-halo-toolboxes by kyuz0

LiteRT-LM by google-ai-edge

S-LoRA by S-LoRA

minions by HazyResearch

distributed-llama by b4rtaz

LMCache by LMCache