llama-swap  by mostlygeek

Proxy server for local model swapping with llama.cpp

created 10 months ago
1,088 stars

Top 35.6% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a lightweight, transparent proxy server for managing multiple local LLM models, primarily targeting users of llama.cpp or any OpenAI-compatible inference server. It simplifies the process of switching between different models on demand, automatically loading and unloading them based on request patterns or timeouts, thereby optimizing resource utilization and simplifying complex LLM deployments.

How It Works

llama-swap acts as a central proxy that intercepts incoming OpenAI API requests. It inspects the model parameter in the request and dynamically starts or switches to the appropriate backend inference server (e.g., llama-server, vllm) configured for that model. This "swapping" mechanism ensures that only the necessary model is active, conserving resources. Advanced configurations allow running multiple models concurrently using "profiles," where each model is exposed on a unique address and port.

Quick Start & Requirements

  • Docker:
    • CPU: docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu qwen2.5 0.5B
    • CUDA: docker run -it --rm --runtime nvidia -p 9292:8080 -v /path/to/models:/models -v /path/to/custom/config.yaml:/app/config.yaml ghcr.io/mostlygeek/llama-swap:cuda
  • Bare Metal: Download pre-built binaries for Linux, FreeBSD, Darwin. Run with llama-swap --config path/to/config.yaml.
  • Source: Requires Go. git clone ... && make clean all.
  • Dependencies: Any OpenAI-compatible inference server (e.g., llama.cpp, vllm). Docker images are available for CPU, CUDA, Intel, and Vulkan. ROCm support is pending.
  • Configuration: A single YAML file (config.yaml) defines models, their startup commands, proxy endpoints, aliases, TTLs, and profiles.

Highlighted Details

  • Supports OpenAI API endpoints: /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/rerank, /v1/audio/speech, /v1/audio/transcriptions.
  • Custom API endpoints for remote log monitoring (/log), direct upstream access (/upstream/:model_id), manual unloading (/unload), and listing running models (/running).
  • Automatic model unloading after a configurable Time-To-Live (TTL) to free up resources.
  • "Profiles" feature allows concurrent execution of multiple models, each on a distinct port.
  • Example use cases include speculative decoding for faster inference and optimizing code generation performance.

Maintenance & Community

  • Docker images are updated nightly.
  • Systemd unit file provided for Ubuntu.
  • Links to example configurations and use cases are available in the README.

Licensing & Compatibility

  • The project appears to be licensed under the MIT License.
  • Compatible with any local OpenAI-compatible server.

Limitations & Caveats

  • ROCm support is disabled until fixed in the llama.cpp container.
  • Some advanced features like audio endpoints or specific llama.cpp features might require specific versions or configurations.
Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
13
Issues (30d)
17
Star History
434 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.