llama-swap by mostlygeek

Proxy server for local model swapping with llama.cpp

Created 1 year ago

2,166 stars

Top 20.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

This project provides a lightweight, transparent proxy server for managing multiple local LLM models, primarily targeting users of llama.cpp or any OpenAI-compatible inference server. It simplifies the process of switching between different models on demand, automatically loading and unloading them based on request patterns or timeouts, thereby optimizing resource utilization and simplifying complex LLM deployments.

How It Works

llama-swap acts as a central proxy that intercepts incoming OpenAI API requests. It inspects the model parameter in the request and dynamically starts or switches to the appropriate backend inference server (e.g., llama-server, vllm) configured for that model. This "swapping" mechanism ensures that only the necessary model is active, conserving resources. Advanced configurations allow running multiple models concurrently using "profiles," where each model is exposed on a unique address and port.

Quick Start & Requirements

Docker:
- CPU: docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu qwen2.5 0.5B
- CUDA: docker run -it --rm --runtime nvidia -p 9292:8080 -v /path/to/models:/models -v /path/to/custom/config.yaml:/app/config.yaml ghcr.io/mostlygeek/llama-swap:cuda
Bare Metal: Download pre-built binaries for Linux, FreeBSD, Darwin. Run with llama-swap --config path/to/config.yaml.
Source: Requires Go. git clone ... && make clean all.
Dependencies: Any OpenAI-compatible inference server (e.g., llama.cpp, vllm). Docker images are available for CPU, CUDA, Intel, and Vulkan. ROCm support is pending.
Configuration: A single YAML file (config.yaml) defines models, their startup commands, proxy endpoints, aliases, TTLs, and profiles.

Highlighted Details

Supports OpenAI API endpoints: /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/rerank, /v1/audio/speech, /v1/audio/transcriptions.
Custom API endpoints for remote log monitoring (/log), direct upstream access (/upstream/:model_id), manual unloading (/unload), and listing running models (/running).
Automatic model unloading after a configurable Time-To-Live (TTL) to free up resources.
"Profiles" feature allows concurrent execution of multiple models, each on a distinct port.
Example use cases include speculative decoding for faster inference and optimizing code generation performance.

Maintenance & Community

Docker images are updated nightly.
Systemd unit file provided for Ubuntu.
Links to example configurations and use cases are available in the README.

Licensing & Compatibility

The project appears to be licensed under the MIT License.
Compatible with any local OpenAI-compatible server.

Limitations & Caveats

ROCm support is disabled until fixed in the llama.cpp container.
Some advanced features like audio endpoints or specific llama.cpp features might require specific versions or configurations.

Health Check

Last Commit

11 hours ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

147 stars in the last 30 days