vllm-turboquant  by mitkox

LLM inference and serving library optimized for speed and efficiency

Created 2 weeks ago

New!

488 stars

Top 63.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

vLLM provides a high-throughput, efficient, and user-friendly library for serving Large Language Models (LLMs). It targets developers and researchers needing to deploy LLMs cost-effectively and with minimal latency. The core benefit is significantly improved serving performance and memory management, making LLM deployment more accessible.

How It Works

vLLM employs several key innovations for speed and efficiency. PagedAttention is central, enabling state-of-the-art throughput by efficiently managing the attention key-value memory. This is combined with continuous batching of incoming requests to maximize GPU utilization. Fast model execution is achieved through CUDA/HIP graph optimizations, integration with FlashAttention/FlashInfer, and support for various quantization formats (GPTQ, AWQ, INT4, INT8, FP8). Speculative decoding and chunked prefill further enhance inference speed.

Quick Start & Requirements

Installation is straightforward via pip: pip install vllm. The project supports a wide array of hardware, including NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, TPUs, and specialized accelerators like Intel Gaudi and Huawei Ascend. Specific CUDA/HIP versions or large dataset requirements are not detailed in the provided text. Official documentation is available at vllm.ai.

Highlighted Details

  • PagedAttention for optimized KV cache management.
  • Continuous batching for high throughput.
  • Extensive quantization support (GPTQ, AWQ, INT4, INT8, FP8).
  • Broad hardware compatibility across major vendors and accelerators.
  • OpenAI-compatible API server for easy integration.
  • Support for Transformer-like, MoE, Embedding, and Multi-modal LLMs.
  • Multi-LoRA support.

Maintenance & Community

vLLM is a community-driven project welcoming contributions. Technical discussions and feature requests are handled via GitHub Issues. User discussions occur on the vLLM Forum, and development coordination happens on Slack. Collaborations can be initiated via collaboration@vllm.ai.

Licensing & Compatibility

The specific open-source license for this repository was not detailed in the provided README content. Compatibility for commercial use or closed-source linking cannot be determined without this information.

Limitations & Caveats

The project is described as using vLLM version 0.18.1rc1, indicating a release candidate status which may imply potential instability or incomplete features. No other specific limitations were mentioned in the provided text.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
2
Star History
489 stars in the last 17 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.1%
929
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 3 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.1%
4k
AI inference pipeline framework
Created 2 years ago
Updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
61 more.

vllm by vllm-project

1.2%
76k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 20 hours ago
Feedback? Help us improve.