production-stack  by vllm-project

Reference stack for production vLLM deployment on Kubernetes

Created 11 months ago
2,097 stars

Top 21.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a reference implementation for deploying vLLM in a Kubernetes-native cluster, targeting users who need to scale inference workloads from single instances to distributed deployments. It offers enhanced observability through a web dashboard and performance benefits via intelligent request routing and KV cache offloading.

How It Works

The stack leverages Helm for deployment, comprising a vLLM serving engine, a request router, and an observability stack (Prometheus + Grafana). The router intelligently directs requests to appropriate vLLM backends, maximizing KV cache reuse and supporting various routing strategies. The observability stack monitors key metrics like request latency, TTFT, and KV cache utilization, providing insights via a Grafana dashboard.

Quick Start & Requirements

  • Install: Clone the repo, then use helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml.
  • Prerequisites: A running Kubernetes (K8s) environment with GPUs.
  • Resources: Requires a K8s cluster with GPU nodes.
  • Docs: Official documentation

Highlighted Details

  • Provides an OpenAI API interface for deployed models.
  • Supports automatic service discovery and fault tolerance via Kubernetes.
  • Offers routing based on session IDs and prefix awareness (WIP).
  • Grafana dashboard monitors instance health, latency, TTFT, and GPU KV usage.

Maintenance & Community

  • Weekly community meetings are held.
  • Active development with a Q1 2025 roadmap including autoscaling and router improvements.
  • Community support via Slack Channel.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is under active development, with features like session-ID based routing and more performant routers in Python-alternative languages noted as "Work In Progress" (WIP).

Health Check
Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
13
Issues (30d)
3
Star History
80 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.