Platform for local LLM inference and serving with OpenAI API compatibility
Top 72.9% on sourcepulse
This project provides an efficient, easy-to-use platform for serving local Large Language Models (LLMs) with an OpenAI-compatible API server. It targets developers and researchers needing to deploy and interact with LLMs locally, offering features like continuous batching, PagedAttention, and support for various quantization formats for optimized performance.
How It Works
candle-vllm leverages Rust and the Candle library for high-performance LLM inference. Its core design emphasizes efficiency through PagedAttention for key-value cache management and continuous batching to maximize GPU utilization. The platform's extensible trait-based system allows for rapid integration of new model architectures and processing pipelines, with built-in support for various quantization methods like GPTQ and Marlin for reduced memory footprint and faster inference.
Quick Start & Requirements
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
followed by cargo run --release --features cuda -- --port 2000 --model-id meta-llama/Llama-2-7b-chat-hf llama
.libssl-dev
, pkg-config
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is under active development, with some model types and sampling methods (e.g., beam search) marked as "TBD" or planned for future implementation. Multi-GPU threaded mode may require disabling P2P communication via export NCCL_P2P_DISABLE=1
.
5 days ago
1 day