candle-vllm by EricLBuehler

Platform for local LLM inference and serving with OpenAI API compatibility

Created 2 years ago

566 stars

Top 56.8% on SourcePulse

Project Summary

This project provides an efficient, easy-to-use platform for serving local Large Language Models (LLMs) with an OpenAI-compatible API server. It targets developers and researchers needing to deploy and interact with LLMs locally, offering features like continuous batching, PagedAttention, and support for various quantization formats for optimized performance.

How It Works

candle-vllm leverages Rust and the Candle library for high-performance LLM inference. Its core design emphasizes efficiency through PagedAttention for key-value cache management and continuous batching to maximize GPU utilization. The platform's extensible trait-based system allows for rapid integration of new model architectures and processing pipelines, with built-in support for various quantization methods like GPTQ and Marlin for reduced memory footprint and faster inference.

Quick Start & Requirements

Install: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh followed by cargo run --release --features cuda -- --port 2000 --model-id meta-llama/Llama-2-7b-chat-hf llama.
Prerequisites: Rust compiler (1.83.0+), CUDA Toolkit (if using GPU acceleration), libssl-dev, pkg-config.
Setup: Cloning the repository and installing Rust toolchain.
Docs: Detailed Usage

Highlighted Details

Supports multiple LLM architectures including Llama, Mistral, Phi, Qwen2, and Gemma.
Achieves high throughput with optimizations like PagedAttention and continuous batching.
Offers in-situ quantization for loading models into various GGML or Marlin formats on the fly.
Provides Mac/Metal device support in addition to CUDA.

Maintenance & Community

Actively developed by EricLBuehler.
Contributions are welcomed for features like beam search and more pipelines.

Licensing & Compatibility

Licensed under Apache 2.0.
Compatible with commercial use and closed-source applications.

Limitations & Caveats

The project is under active development, with some model types and sampling methods (e.g., beam search) marked as "TBD" or planned for future implementation. Multi-GPU threaded mode may require disabling P2P communication via export NCCL_P2P_DISABLE=1.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days