production-stack by vllm-project

Reference stack for production vLLM deployment on Kubernetes

Created 11 months ago

2,097 stars

Top 21.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This project provides a reference implementation for deploying vLLM in a Kubernetes-native cluster, targeting users who need to scale inference workloads from single instances to distributed deployments. It offers enhanced observability through a web dashboard and performance benefits via intelligent request routing and KV cache offloading.

How It Works

The stack leverages Helm for deployment, comprising a vLLM serving engine, a request router, and an observability stack (Prometheus + Grafana). The router intelligently directs requests to appropriate vLLM backends, maximizing KV cache reuse and supporting various routing strategies. The observability stack monitors key metrics like request latency, TTFT, and KV cache utilization, providing insights via a Grafana dashboard.

Quick Start & Requirements

Install: Clone the repo, then use helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml.
Prerequisites: A running Kubernetes (K8s) environment with GPUs.
Resources: Requires a K8s cluster with GPU nodes.
Docs: Official documentation

Highlighted Details

Provides an OpenAI API interface for deployed models.
Supports automatic service discovery and fault tolerance via Kubernetes.
Offers routing based on session IDs and prefix awareness (WIP).
Grafana dashboard monitors instance health, latency, TTFT, and GPU KV usage.

Maintenance & Community

Weekly community meetings are held.
Active development with a Q1 2025 roadmap including autoscaling and router improvements.
Community support via Slack Channel.

Licensing & Compatibility

Licensed under Apache License 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is under active development, with features like session-ID based routing and more performant routers in Python-alternative languages noted as "Work In Progress" (WIP).

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

80 stars in the last 30 days