llama-box  by gpustack

LLM inference server with OpenAI API compatibility

created 1 year ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

LLaMA Box is a C++ based inference server that provides a pure API for various large language and multimodal models, aiming for OpenAI compatibility. It targets developers and researchers needing a flexible backend for LLM applications, offering broad model support and advanced features like speculative decoding and RPC server mode.

How It Works

LLaMA Box leverages the llama.cpp and stable-diffusion.cpp backends to offer high-performance inference. It supports a wide array of models, including LLaVA, MiniCPM, Qwen, LLaMA, Mistral, and more, with specific chat templates for compatibility. The server architecture allows for features like tensor splitting across multiple GPUs and RPC server mode for distributed inference, optimizing resource utilization and scalability.

Quick Start & Requirements

  • Install: Download the latest release binary.
  • Prerequisites: NVIDIA CUDA (>=12.8.0), AMD ROCm/HIP (>=6.2.4), Intel oneAPI (2025.0.0), Vulkan (1.4.313), Huawei Ascend CANN (>=8.1.rc1), HYGON DTK (25.04), Moore Threads MUSA (rc4.2), Apple Metal 3, or AVX2/NEON/AVX512 CPU instruction sets. Specific driver versions are required for NVIDIA and AMD.
  • Setup: Varies based on backend and model; requires downloading GGUF model files.
  • Docs: https://github.com/gpustack/llama-box

Highlighted Details

  • Full OpenAI API compatibility (Chat, Vision, Embeddings, Images).
  • Supports speculative decoding (draft and lookup models).
  • RPC server mode for distributed inference and tensor splitting.
  • Extensive hardware acceleration support including NVIDIA, AMD, Intel, Apple Metal, and CPU instruction sets.

Maintenance & Community

The project is actively maintained by gpustack. Community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The project is released under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README mentions experimental support for some features and models. Performance may vary significantly based on the chosen backend, hardware, and model configuration. Some advanced features like specific chat templates or speculative decoding might require specific model file conversions.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
6
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Research Engineer at Mistral; Author of Hugging Face Diffusers), Junyang Lin Junyang Lin(Core Maintainer of Alibaba Qwen), and
2 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 weeks ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Shizhe Diao Shizhe Diao(Research Scientist at NVIDIA; Author of LMFlow), and
13 more.

TensorRT-LLM by NVIDIA

0.5%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 2 years ago
updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
41 more.

llama.cpp by ggml-org

0.7%
85k
C/C++ library for local LLM inference
created 2 years ago
updated 1 day ago
Feedback? Help us improve.