DistServe by LLMServe

Disaggregated serving system for LLMs

Created 2 years ago

761 stars

Top 45.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Lianmin Zheng

Coauthor of SGLang, vLLM

Project Summary

DistServe is a disaggregated LLM serving system designed to improve throughput by separating prefill and decoding computations. It targets researchers and engineers working with large language models who need to optimize inference performance, offering automatic KV-Cache management and flexible parallelism configuration.

How It Works

DistServe decouples the prefill and decoding phases of LLM inference, which are typically colocated and batched together in existing systems. This disaggregation reduces interference between the two phases and allows for independent resource allocation and parallelism strategies. It leverages the SwiftTransformer C++ library as its backend, which supports advanced features like FlashAttention, Continuous Batching, and PagedAttention, enabling efficient KV-Cache handling and memory management.

Quick Start & Requirements

Install: Clone the repository, create a Conda environment (conda env create -f environment.yml), activate it (conda activate distserve), clone and build SwiftTransformer (git clone ... && cd SwiftTransformer && cmake ...), and install DistServe (pip install -e .).
Prerequisites: Requires at least two GPUs.
Resources: Building SwiftTransformer involves CMake and potentially nproc for parallel compilation.
Links: SwiftTransformer

Highlighted Details

Supports GPT-2, OPT, and LLaMA2 model families.
Utilizes SwiftTransformer backend with FlashAttention and PagedAttention.
Enables independent parallelism and scheduling for prefill and decoding.
Offers automatic KV-Cache communication and memory management.

Maintenance & Community

The project appears to be associated with the authors of the cited paper "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving". No specific community channels or active development signals are present in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project's dependencies, particularly SwiftTransformer, may have their own licenses that need to be reviewed for commercial use or closed-source integration.

Limitations & Caveats

The system requires a minimum of two GPUs for operation. The README does not detail specific performance benchmarks or provide information on supported operating systems beyond what is implied by the build process.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days