production-stack  by vllm-project

Reference stack for production vLLM deployment on Kubernetes

created 6 months ago
1,584 stars

Top 26.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a reference implementation for deploying vLLM in a Kubernetes-native cluster, targeting users who need to scale inference workloads from single instances to distributed deployments. It offers enhanced observability through a web dashboard and performance benefits via intelligent request routing and KV cache offloading.

How It Works

The stack leverages Helm for deployment, comprising a vLLM serving engine, a request router, and an observability stack (Prometheus + Grafana). The router intelligently directs requests to appropriate vLLM backends, maximizing KV cache reuse and supporting various routing strategies. The observability stack monitors key metrics like request latency, TTFT, and KV cache utilization, providing insights via a Grafana dashboard.

Quick Start & Requirements

  • Install: Clone the repo, then use helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml.
  • Prerequisites: A running Kubernetes (K8s) environment with GPUs.
  • Resources: Requires a K8s cluster with GPU nodes.
  • Docs: Official documentation

Highlighted Details

  • Provides an OpenAI API interface for deployed models.
  • Supports automatic service discovery and fault tolerance via Kubernetes.
  • Offers routing based on session IDs and prefix awareness (WIP).
  • Grafana dashboard monitors instance health, latency, TTFT, and GPU KV usage.

Maintenance & Community

  • Weekly community meetings are held.
  • Active development with a Q1 2025 roadmap including autoscaling and router improvements.
  • Community support via Slack Channel.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project is under active development, with features like session-ID based routing and more performant routers in Python-alternative languages noted as "Work In Progress" (WIP).

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
50
Issues (30d)
17
Star History
454 stars in the last 90 days

Explore Similar Projects

Starred by Eugene Yan Eugene Yan(AI Scientist at AWS), Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

seldon-core by SeldonIO

0.1%
5k
MLOps framework for production model deployment on Kubernetes
created 7 years ago
updated 1 day ago
Feedback? Help us improve.