vllm  by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

created 2 years ago
53,610 stars

Top 0.5% on sourcepulse

GitHubView on GitHub
Project Summary

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs). It targets developers and researchers needing to deploy LLMs efficiently, offering significant speedups and cost reductions through advanced memory management and optimized execution.

How It Works

vLLM employs PagedAttention, a novel memory management technique inspired by virtual memory paging, to efficiently manage the attention key and value (KV) cache. This allows for continuous batching of requests and reduces memory fragmentation, leading to higher throughput and lower latency compared to traditional methods. It also leverages CUDA/HIP graphs and optimized kernels like FlashAttention for faster model execution.

Quick Start & Requirements

Highlighted Details

  • Achieves state-of-the-art serving throughput.
  • Supports a wide range of LLMs, including Mixture-of-Experts and multimodal models.
  • Offers an OpenAI-compatible API server.
  • Features continuous batching, speculative decoding, and prefix caching.

Maintenance & Community

  • Active community with regular meetups and contributions from major tech companies (Meta, NVIDIA, Google Cloud, AWS).
  • Developer Slack available at slack.vllm.ai.
  • Significant industry sponsorship from a16z, Dropbox, and others.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

  • While supporting a broad range of hardware, optimal performance is typically achieved on NVIDIA GPUs.
  • The alpha release of vLLM V1 (Jan 2025) indicates ongoing architectural changes and potential for breaking updates.
Health Check
Last commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)
1,223
Issues (30d)
917
Star History
7,590 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.7%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 11 hours ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Robert Nishihara Robert Nishihara(Cofounder of Anyscale; Author of Ray), and
4 more.

verl by volcengine

2.4%
12k
RL training library for LLMs
created 9 months ago
updated 10 hours ago
Feedback? Help us improve.