Reference stack for production vLLM deployment on Kubernetes
Top 26.9% on sourcepulse
This project provides a reference implementation for deploying vLLM in a Kubernetes-native cluster, targeting users who need to scale inference workloads from single instances to distributed deployments. It offers enhanced observability through a web dashboard and performance benefits via intelligent request routing and KV cache offloading.
How It Works
The stack leverages Helm for deployment, comprising a vLLM serving engine, a request router, and an observability stack (Prometheus + Grafana). The router intelligently directs requests to appropriate vLLM backends, maximizing KV cache reuse and supporting various routing strategies. The observability stack monitors key metrics like request latency, TTFT, and KV cache utilization, providing insights via a Grafana dashboard.
Quick Start & Requirements
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is under active development, with features like session-ID based routing and more performant routers in Python-alternative languages noted as "Work In Progress" (WIP).
1 day ago
1 day