vllm by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

Created 2 years ago

67,201 stars

Top 0.3% on SourcePulse

View on GitHub

62 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Clement Delangue

Cofounder of Hugging Face

and 58 more!

Project Summary

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs). It targets developers and researchers needing to deploy LLMs efficiently, offering significant speedups and cost reductions through advanced memory management and optimized execution.

How It Works

vLLM employs PagedAttention, a novel memory management technique inspired by virtual memory paging, to efficiently manage the attention key and value (KV) cache. This allows for continuous batching of requests and reduces memory fragmentation, leading to higher throughput and lower latency compared to traditional methods. It also leverages CUDA/HIP graphs and optimized kernels like FlashAttention for faster model execution.

Quick Start & Requirements

Install with pip: pip install vllm
Requires NVIDIA GPUs with CUDA 11.8+ or AMD GPUs with ROCm 5.6+.
Supports various quantization methods (GPTQ, AWQ, INT4, INT8, FP8).
Documentation: https://docs.vllm.ai/
Quickstart: https://docs.vllm.ai/en/latest/getting_started/quickstart.html

Highlighted Details

Achieves state-of-the-art serving throughput.
Supports a wide range of LLMs, including Mixture-of-Experts and multimodal models.
Offers an OpenAI-compatible API server.
Features continuous batching, speculative decoding, and prefix caching.

Maintenance & Community

Active community with regular meetups and contributions from major tech companies (Meta, NVIDIA, Google Cloud, AWS).
Developer Slack available at slack.vllm.ai.
Significant industry sponsorship from a16z, Dropbox, and others.

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

While supporting a broad range of hardware, optimal performance is typically achieved on NVIDIA GPUs.
The alpha release of vLLM V1 (Jan 2025) indicates ongoing architectural changes and potential for breaking updates.

Health Check

Last Commit

9 hours ago

Responsiveness

1 day

Pull Requests (30d)

1,447

Issues (30d)

1,136

Star History

2,118 stars in the last 30 days