llama-box by gpustack

LLM inference server with OpenAI API compatibility

Created 1 year ago

293 stars

Top 90.2% on SourcePulse

Project Summary

LLaMA Box is a C++ based inference server that provides a pure API for various large language and multimodal models, aiming for OpenAI compatibility. It targets developers and researchers needing a flexible backend for LLM applications, offering broad model support and advanced features like speculative decoding and RPC server mode.

How It Works

LLaMA Box leverages the llama.cpp and stable-diffusion.cpp backends to offer high-performance inference. It supports a wide array of models, including LLaVA, MiniCPM, Qwen, LLaMA, Mistral, and more, with specific chat templates for compatibility. The server architecture allows for features like tensor splitting across multiple GPUs and RPC server mode for distributed inference, optimizing resource utilization and scalability.

Quick Start & Requirements

Install: Download the latest release binary.
Prerequisites: NVIDIA CUDA (>=12.8.0), AMD ROCm/HIP (>=6.2.4), Intel oneAPI (2025.0.0), Vulkan (1.4.313), Huawei Ascend CANN (>=8.1.rc1), HYGON DTK (25.04), Moore Threads MUSA (rc4.2), Apple Metal 3, or AVX2/NEON/AVX512 CPU instruction sets. Specific driver versions are required for NVIDIA and AMD.
Setup: Varies based on backend and model; requires downloading GGUF model files.
Docs: https://github.com/gpustack/llama-box

Highlighted Details

Full OpenAI API compatibility (Chat, Vision, Embeddings, Images).
Supports speculative decoding (draft and lookup models).
RPC server mode for distributed inference and tensor splitting.
Extensive hardware acceleration support including NVIDIA, AMD, Intel, Apple Metal, and CPU instruction sets.

Maintenance & Community

The project is actively maintained by gpustack. Community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The project is released under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The README mentions experimental support for some features and models. Performance may vary significantly based on the chosen backend, hardware, and model configuration. Some advanced features like specific chat templates or speculative decoding might require specific model file conversions.

llama-box by gpustack

Explore Similar Projects

OpenArc by SearchSavior

edgen by edgenai

dash-infer by modelscope

candle-vllm by EricLBuehler

Kokoros by lucasjinreal

ZhiLight by zhihu

tabbyAPI by theroyallab

distributed-llama by b4rtaz

lorax by predibase

nexa-sdk by NexaAI

text-generation-inference by huggingface

mlc-llm by mlc-ai