vllm-playground  by micytao

Modern web UI for vLLM LLM serving

Created 2 months ago
304 stars

Top 88.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

vLLM Playground offers a modern, web-based interface for managing and interacting with vLLM inference servers. It targets engineers and researchers needing a streamlined way to deploy and test LLMs, providing automatic container management for local development and enterprise-grade orchestration for Kubernetes/OpenShift environments. The project simplifies vLLM setup, supports both GPU and CPU modes, and includes optimizations for macOS Apple Silicon.

How It Works

The project employs a hybrid architecture with a FastAPI backend. For local development, it leverages Podman for container orchestration, automatically managing the vLLM service lifecycle. In enterprise settings, it utilizes the Kubernetes API to dynamically create and manage vLLM pods. This design ensures a consistent user experience across local and cloud deployments, featuring intelligent hardware detection (especially GPU availability via Kubernetes API) and seamless switching between environments.

Quick Start & Requirements

  • PyPI Install: pip install vllm-playground. Run via vllm-playground.
  • Container Orchestration (Source): Clone repo, install Podman, pip install -r requirements.txt, then python run.py.
  • OpenShift/Kubernetes: Build UI container, deploy using ./deploy.sh --gpu or --cpu scripts within the openshift/ directory.
  • Prerequisites: Python, Podman (local containers), Kubernetes/OpenShift cluster (enterprise), HuggingFace token (for gated models like Llama/Gemma). GPU hardware is auto-detected.
  • Documentation: Quick Start Guide (docs/QUICKSTART.md), OpenShift Deployment (openshift/QUICK_START.md), macOS CPU Guide (docs/MACOS_CPU_GUIDE.md).

Highlighted Details

  • Container Orchestration: Automatic vLLM container lifecycle management via Podman (local) or Kubernetes API (cloud).
  • Enterprise Deployment: Production-ready OpenShift/Kubernetes integration with dynamic pod creation and RBAC security.
  • macOS Optimization: Dedicated support for Apple Silicon via containerized CPU mode.
  • GuideLLM Benchmarking: Integrated load testing for performance analysis (throughput, latency).
  • vLLM Community Recipes: One-click model configuration loading, synced from official vLLM recipes.
  • Intelligent Hardware Detection: Automatic GPU detection using Kubernetes API, adapting UI availability.
  • Gated Model Access: Built-in support for HuggingFace tokens for restricted models.

Maintenance & Community

No specific details on maintainers, community channels (e.g., Discord, Slack), or active development signals were found in the provided README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and modification.

Limitations & Caveats

Accessing gated models requires a HuggingFace token. CPU-only inference can be slow for larger models. Running GuideLLM benchmarks may necessitate significant memory resources (e.g., 16Gi+ for GPU, 64Gi+ for CPU). macOS CPU mode is recommended via containerization.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
2
Star History
149 stars in the last 30 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

production-stack by vllm-project

0.9%
2k
Reference stack for production vLLM deployment on Kubernetes
Created 11 months ago
Updated 4 days ago
Feedback? Help us improve.