paddler by intentee

Load balancer for llama.cpp servers

Created 1 year ago

1,414 stars

Top 28.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

Paddler is a stateful load balancer and reverse proxy specifically designed for llama.cpp servers, addressing the limitations of traditional load balancing strategies with AI workloads. It targets users running llama.cpp who need efficient request distribution aware of llama.cpp's unique slot-based concurrency model, enabling better resource utilization and scalability.

How It Works

Paddler employs a distributed agent-based architecture. Agents run alongside each llama.cpp instance, monitoring its available "slots" (concurrent request processing units) and reporting this state to the central Paddler balancer. The balancer then uses this slot-aware state to distribute incoming requests, ensuring optimal utilization of each llama.cpp server's capacity. This stateful approach is crucial for llama.cpp's continuous batching, unlike stateless methods.

Quick Start & Requirements

Install by downloading pre-compiled binaries for Linux, macOS, or Windows from the releases page.
Requires llama.cpp servers to be running with the --slots flag enabled.
Agents require --external-llamacpp-addr, --local-llamacpp-addr, and --management-addr flags.
The balancer requires --management-addr and --reverseproxy-addr flags.
Setup involves running the balancer and its agents, with configuration for llama.cpp's slot endpoint.

Highlighted Details

State-aware load balancing based on llama.cpp slots.
Dynamic addition/removal of llama.cpp instances for autoscaling.
Request buffering to support scaling from zero hosts.
Built-in dashboard and StatsD metrics for monitoring.
AWS integration capabilities.

Maintenance & Community

Actively maintained with recent updates including a TUI dashboard.
Discord community available at https://discord.gg/kysUzFqSCK.
Requires llama.cpp version b4027 or above.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

The project was recently rewritten in Rust using the Pingora framework, with v1.0.0 marking API stability until v2.0.0.
Exposing the /slots endpoint requires explicit enablement via the --slots-endpoint-enable flag due to sensitive information disclosure.

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days