candle-vllm  by EricLBuehler

Platform for local LLM inference and serving with OpenAI API compatibility

created 1 year ago
404 stars

Top 72.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an efficient, easy-to-use platform for serving local Large Language Models (LLMs) with an OpenAI-compatible API server. It targets developers and researchers needing to deploy and interact with LLMs locally, offering features like continuous batching, PagedAttention, and support for various quantization formats for optimized performance.

How It Works

candle-vllm leverages Rust and the Candle library for high-performance LLM inference. Its core design emphasizes efficiency through PagedAttention for key-value cache management and continuous batching to maximize GPU utilization. The platform's extensible trait-based system allows for rapid integration of new model architectures and processing pipelines, with built-in support for various quantization methods like GPTQ and Marlin for reduced memory footprint and faster inference.

Quick Start & Requirements

  • Install: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh followed by cargo run --release --features cuda -- --port 2000 --model-id meta-llama/Llama-2-7b-chat-hf llama.
  • Prerequisites: Rust compiler (1.83.0+), CUDA Toolkit (if using GPU acceleration), libssl-dev, pkg-config.
  • Setup: Cloning the repository and installing Rust toolchain.
  • Docs: Detailed Usage

Highlighted Details

  • Supports multiple LLM architectures including Llama, Mistral, Phi, Qwen2, and Gemma.
  • Achieves high throughput with optimizations like PagedAttention and continuous batching.
  • Offers in-situ quantization for loading models into various GGML or Marlin formats on the fly.
  • Provides Mac/Metal device support in addition to CUDA.

Maintenance & Community

  • Actively developed by EricLBuehler.
  • Contributions are welcomed for features like beam search and more pipelines.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The project is under active development, with some model types and sampling methods (e.g., beam search) marked as "TBD" or planned for future implementation. Multi-GPU threaded mode may require disabling P2P communication via export NCCL_P2P_DISABLE=1.

Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
18
Issues (30d)
17
Star History
49 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 9 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 9 hours ago
Feedback? Help us improve.