vllm  by vllm-project

LLM serving engine for high-throughput, memory-efficient inference

Created 2 years ago
58,308 stars

Top 0.4% on SourcePulse

GitHubView on GitHub
Project Summary

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs). It targets developers and researchers needing to deploy LLMs efficiently, offering significant speedups and cost reductions through advanced memory management and optimized execution.

How It Works

vLLM employs PagedAttention, a novel memory management technique inspired by virtual memory paging, to efficiently manage the attention key and value (KV) cache. This allows for continuous batching of requests and reduces memory fragmentation, leading to higher throughput and lower latency compared to traditional methods. It also leverages CUDA/HIP graphs and optimized kernels like FlashAttention for faster model execution.

Quick Start & Requirements

Highlighted Details

  • Achieves state-of-the-art serving throughput.
  • Supports a wide range of LLMs, including Mixture-of-Experts and multimodal models.
  • Offers an OpenAI-compatible API server.
  • Features continuous batching, speculative decoding, and prefix caching.

Maintenance & Community

  • Active community with regular meetups and contributions from major tech companies (Meta, NVIDIA, Google Cloud, AWS).
  • Developer Slack available at slack.vllm.ai.
  • Significant industry sponsorship from a16z, Dropbox, and others.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

  • While supporting a broad range of hardware, optimal performance is typically achieved on NVIDIA GPUs.
  • The alpha release of vLLM V1 (Jan 2025) indicates ongoing architectural changes and potential for breaking updates.
Health Check
Last Commit

12 hours ago

Responsiveness

1 day

Pull Requests (30d)
1,602
Issues (30d)
976
Star History
2,746 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 22 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.