DistServe  by LLMServe

Disaggregated serving system for LLMs

created 1 year ago
651 stars

Top 52.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DistServe is a disaggregated LLM serving system designed to improve throughput by separating prefill and decoding computations. It targets researchers and engineers working with large language models who need to optimize inference performance, offering automatic KV-Cache management and flexible parallelism configuration.

How It Works

DistServe decouples the prefill and decoding phases of LLM inference, which are typically colocated and batched together in existing systems. This disaggregation reduces interference between the two phases and allows for independent resource allocation and parallelism strategies. It leverages the SwiftTransformer C++ library as its backend, which supports advanced features like FlashAttention, Continuous Batching, and PagedAttention, enabling efficient KV-Cache handling and memory management.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment (conda env create -f environment.yml), activate it (conda activate distserve), clone and build SwiftTransformer (git clone ... && cd SwiftTransformer && cmake ...), and install DistServe (pip install -e .).
  • Prerequisites: Requires at least two GPUs.
  • Resources: Building SwiftTransformer involves CMake and potentially nproc for parallel compilation.
  • Links: SwiftTransformer

Highlighted Details

  • Supports GPT-2, OPT, and LLaMA2 model families.
  • Utilizes SwiftTransformer backend with FlashAttention and PagedAttention.
  • Enables independent parallelism and scheduling for prefill and decoding.
  • Offers automatic KV-Cache communication and memory management.

Maintenance & Community

The project appears to be associated with the authors of the cited paper "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving". No specific community channels or active development signals are present in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project's dependencies, particularly SwiftTransformer, may have their own licenses that need to be reviewed for commercial use or closed-source integration.

Limitations & Caveats

The system requires a minimum of two GPUs for operation. The README does not detail specific performance benchmarks or provide information on supported operating systems beyond what is implied by the build process.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
82 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 1 day ago
Feedback? Help us improve.