candle-vllm  by EricLBuehler

Platform for local LLM inference and serving with OpenAI API compatibility

Created 1 year ago
462 stars

Top 65.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an efficient, easy-to-use platform for serving local Large Language Models (LLMs) with an OpenAI-compatible API server. It targets developers and researchers needing to deploy and interact with LLMs locally, offering features like continuous batching, PagedAttention, and support for various quantization formats for optimized performance.

How It Works

candle-vllm leverages Rust and the Candle library for high-performance LLM inference. Its core design emphasizes efficiency through PagedAttention for key-value cache management and continuous batching to maximize GPU utilization. The platform's extensible trait-based system allows for rapid integration of new model architectures and processing pipelines, with built-in support for various quantization methods like GPTQ and Marlin for reduced memory footprint and faster inference.

Quick Start & Requirements

  • Install: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh followed by cargo run --release --features cuda -- --port 2000 --model-id meta-llama/Llama-2-7b-chat-hf llama.
  • Prerequisites: Rust compiler (1.83.0+), CUDA Toolkit (if using GPU acceleration), libssl-dev, pkg-config.
  • Setup: Cloning the repository and installing Rust toolchain.
  • Docs: Detailed Usage

Highlighted Details

  • Supports multiple LLM architectures including Llama, Mistral, Phi, Qwen2, and Gemma.
  • Achieves high throughput with optimizations like PagedAttention and continuous batching.
  • Offers in-situ quantization for loading models into various GGML or Marlin formats on the fly.
  • Provides Mac/Metal device support in addition to CUDA.

Maintenance & Community

  • Actively developed by EricLBuehler.
  • Contributions are welcomed for features like beam search and more pipelines.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The project is under active development, with some model types and sampling methods (e.g., beam search) marked as "TBD" or planned for future implementation. Multi-GPU threaded mode may require disabling P2P communication via export NCCL_P2P_DISABLE=1.

Health Check
Last Commit

13 hours ago

Responsiveness

1 day

Pull Requests (30d)
18
Issues (30d)
23
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 13 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anil Dash Anil Dash(Former CEO of Glitch), and
23 more.

llamafile by Mozilla-Ocho

0.1%
23k
Single-file LLM distribution and runtime via `llama.cpp` and Cosmopolitan Libc
Created 2 years ago
Updated 2 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 13 hours ago
Feedback? Help us improve.