Proxy server for local model swapping with llama.cpp
Top 35.6% on sourcepulse
This project provides a lightweight, transparent proxy server for managing multiple local LLM models, primarily targeting users of llama.cpp
or any OpenAI-compatible inference server. It simplifies the process of switching between different models on demand, automatically loading and unloading them based on request patterns or timeouts, thereby optimizing resource utilization and simplifying complex LLM deployments.
How It Works
llama-swap acts as a central proxy that intercepts incoming OpenAI API requests. It inspects the model
parameter in the request and dynamically starts or switches to the appropriate backend inference server (e.g., llama-server
, vllm
) configured for that model. This "swapping" mechanism ensures that only the necessary model is active, conserving resources. Advanced configurations allow running multiple models concurrently using "profiles," where each model is exposed on a unique address and port.
Quick Start & Requirements
docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu qwen2.5 0.5B
docker run -it --rm --runtime nvidia -p 9292:8080 -v /path/to/models:/models -v /path/to/custom/config.yaml:/app/config.yaml ghcr.io/mostlygeek/llama-swap:cuda
llama-swap --config path/to/config.yaml
.git clone ... && make clean all
.llama.cpp
, vllm
). Docker images are available for CPU, CUDA, Intel, and Vulkan. ROCm support is pending.config.yaml
) defines models, their startup commands, proxy endpoints, aliases, TTLs, and profiles.Highlighted Details
/v1/completions
, /v1/chat/completions
, /v1/embeddings
, /v1/rerank
, /v1/audio/speech
, /v1/audio/transcriptions
./log
), direct upstream access (/upstream/:model_id
), manual unloading (/unload
), and listing running models (/running
).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
llama.cpp
features might require specific versions or configurations.3 days ago
1 day