worker-vllm  by runpod-workers

RunPod worker template for blazing-fast LLM endpoints

created 2 years ago
337 stars

Top 82.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a RunPod worker template for deploying large language model (LLM) endpoints using the vLLM inference engine. It targets developers and researchers who need to serve LLMs efficiently and offers OpenAI-compatible API endpoints for seamless integration with existing applications.

How It Works

The worker leverages vLLM's optimized inference capabilities, including PagedAttention for efficient memory management and continuous batching for high throughput. It supports a wide range of Hugging Face compatible model architectures and can be configured via environment variables or by building a custom Docker image with the model baked in. This approach allows for flexible deployment and fine-tuning of LLM serving parameters.

Quick Start & Requirements

  • Install/Run: Deploy via RunPod Serverless Endpoint using pre-built Docker images (e.g., runpod/worker-v1-vllm:v2.4.0stable-cuda12.1.0).
  • Prerequisites: RunPod account, CUDA 12.1.0+ recommended.
  • Setup: Near-instant deployment due to image caching.
  • Docs: RunPod Serverless Worker vLLM

Highlighted Details

  • OpenAI-compatible API for Chat Completions and Models.
  • Supports numerous LLM architectures including Llama, Mistral, Mixtral, Qwen, and more.
  • Extensive configuration options for quantization, tensor parallelism, batching, and sampling.
  • Image caching across RunPod machines for rapid deployment.

Maintenance & Community

Maintained by RunPod. Community support channels are not explicitly listed in the README.

Licensing & Compatibility

The project itself appears to be open-source, but the underlying vLLM library has its own license. Compatibility for commercial use depends on the specific LLM model deployed and its associated license.

Limitations & Caveats

The README notes that logit_bias and user parameters are unsupported by vLLM for OpenAI compatibility. Some advanced configurations or less common model architectures might require custom Docker image builds.

Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
5
Issues (30d)
6
Star History
33 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 22 hours ago
Feedback? Help us improve.