vllm-playground by micytao

Modern web UI for vLLM LLM serving

Created 3 months ago

386 stars

Top 74.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

vLLM Playground offers a modern, web-based interface for managing and interacting with vLLM inference servers. It targets engineers and researchers needing a streamlined way to deploy and test LLMs, providing automatic container management for local development and enterprise-grade orchestration for Kubernetes/OpenShift environments. The project simplifies vLLM setup, supports both GPU and CPU modes, and includes optimizations for macOS Apple Silicon.

How It Works

The project employs a hybrid architecture with a FastAPI backend. For local development, it leverages Podman for container orchestration, automatically managing the vLLM service lifecycle. In enterprise settings, it utilizes the Kubernetes API to dynamically create and manage vLLM pods. This design ensures a consistent user experience across local and cloud deployments, featuring intelligent hardware detection (especially GPU availability via Kubernetes API) and seamless switching between environments.

Quick Start & Requirements

PyPI Install: pip install vllm-playground. Run via vllm-playground.
Container Orchestration (Source): Clone repo, install Podman, pip install -r requirements.txt, then python run.py.
OpenShift/Kubernetes: Build UI container, deploy using ./deploy.sh --gpu or --cpu scripts within the openshift/ directory.
Prerequisites: Python, Podman (local containers), Kubernetes/OpenShift cluster (enterprise), HuggingFace token (for gated models like Llama/Gemma). GPU hardware is auto-detected.
Documentation: Quick Start Guide (docs/QUICKSTART.md), OpenShift Deployment (openshift/QUICK_START.md), macOS CPU Guide (docs/MACOS_CPU_GUIDE.md).

Highlighted Details

Container Orchestration: Automatic vLLM container lifecycle management via Podman (local) or Kubernetes API (cloud).
Enterprise Deployment: Production-ready OpenShift/Kubernetes integration with dynamic pod creation and RBAC security.
macOS Optimization: Dedicated support for Apple Silicon via containerized CPU mode.
GuideLLM Benchmarking: Integrated load testing for performance analysis (throughput, latency).
vLLM Community Recipes: One-click model configuration loading, synced from official vLLM recipes.
Intelligent Hardware Detection: Automatic GPU detection using Kubernetes API, adapting UI availability.
Gated Model Access: Built-in support for HuggingFace tokens for restricted models.

Maintenance & Community

No specific details on maintainers, community channels (e.g., Discord, Slack), or active development signals were found in the provided README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and modification.

Limitations & Caveats

Accessing gated models requires a HuggingFace token. CPU-only inference can be slow for larger models. Running GuideLLM benchmarks may necessitate significant memory resources (e.g., 16Gi+ for GPU, 64Gi+ for CPU). macOS CPU mode is recommended via containerization.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

42 stars in the last 30 days