llama-box  by gpustack

LLM inference server with OpenAI API compatibility

Created 1 year ago
281 stars

Top 92.6% on SourcePulse

GitHubView on GitHub
Project Summary

LLaMA Box is a C++ based inference server that provides a pure API for various large language and multimodal models, aiming for OpenAI compatibility. It targets developers and researchers needing a flexible backend for LLM applications, offering broad model support and advanced features like speculative decoding and RPC server mode.

How It Works

LLaMA Box leverages the llama.cpp and stable-diffusion.cpp backends to offer high-performance inference. It supports a wide array of models, including LLaVA, MiniCPM, Qwen, LLaMA, Mistral, and more, with specific chat templates for compatibility. The server architecture allows for features like tensor splitting across multiple GPUs and RPC server mode for distributed inference, optimizing resource utilization and scalability.

Quick Start & Requirements

  • Install: Download the latest release binary.
  • Prerequisites: NVIDIA CUDA (>=12.8.0), AMD ROCm/HIP (>=6.2.4), Intel oneAPI (2025.0.0), Vulkan (1.4.313), Huawei Ascend CANN (>=8.1.rc1), HYGON DTK (25.04), Moore Threads MUSA (rc4.2), Apple Metal 3, or AVX2/NEON/AVX512 CPU instruction sets. Specific driver versions are required for NVIDIA and AMD.
  • Setup: Varies based on backend and model; requires downloading GGUF model files.
  • Docs: https://github.com/gpustack/llama-box

Highlighted Details

  • Full OpenAI API compatibility (Chat, Vision, Embeddings, Images).
  • Supports speculative decoding (draft and lookup models).
  • RPC server mode for distributed inference and tensor splitting.
  • Extensive hardware acceleration support including NVIDIA, AMD, Intel, Apple Metal, and CPU instruction sets.

Maintenance & Community

The project is actively maintained by gpustack. Community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The project is released under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README mentions experimental support for some features and models. Performance may vary significantly based on the chosen backend, hardware, and model configuration. Some advanced features like specific chat templates or speculative decoding might require specific model file conversions.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
6
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
8 more.

lorax by predibase

0.9%
3k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.