Load balancer for llama.cpp servers
Top 45.0% on sourcepulse
Paddler is a stateful load balancer and reverse proxy specifically designed for llama.cpp
servers, addressing the limitations of traditional load balancing strategies with AI workloads. It targets users running llama.cpp
who need efficient request distribution aware of llama.cpp
's unique slot-based concurrency model, enabling better resource utilization and scalability.
How It Works
Paddler employs a distributed agent-based architecture. Agents run alongside each llama.cpp
instance, monitoring its available "slots" (concurrent request processing units) and reporting this state to the central Paddler balancer. The balancer then uses this slot-aware state to distribute incoming requests, ensuring optimal utilization of each llama.cpp
server's capacity. This stateful approach is crucial for llama.cpp
's continuous batching, unlike stateless methods.
Quick Start & Requirements
llama.cpp
servers to be running with the --slots
flag enabled.--external-llamacpp-addr
, --local-llamacpp-addr
, and --management-addr
flags.--management-addr
and --reverseproxy-addr
flags.llama.cpp
's slot endpoint.Highlighted Details
llama.cpp
slots.llama.cpp
instances for autoscaling.Maintenance & Community
llama.cpp
version b4027 or above.Licensing & Compatibility
Limitations & Caveats
/slots
endpoint requires explicit enablement via the --slots-endpoint-enable
flag due to sensitive information disclosure.1 day ago
1 day