olla by thushan

High-performance proxy and load balancer for LLM infrastructure

Created 1 year ago

252 stars

Top 99.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Joe Walnes

Head of Experimental Projects at Stripe

Project Summary

Summary

Olla is a high-performance, low-overhead proxy and load balancer for LLM infrastructure. It intelligently routes requests across diverse inference backends, offering automatic failover, unified model discovery, and sticky sessions to enhance reliability and efficiency. This tool targets engineers and researchers managing LLM deployments, providing a unified interface to various inference engines.

How It Works

Olla acts as an intelligent intermediary, directing LLM requests to suitable inference backends. It features two proxy engines: Sherpa (simple) and Olla (advanced, with circuit breakers/connection pooling). Key functions include unifying model discovery across providers and KV-cache-aware affinity routing for sticky sessions. Automatic failover, retries, and continuous health monitoring ensure high availability.

Quick Start & Requirements

Installation options include a bash script (curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh | bash), Docker (docker run -p 40114:40114 ghcr.io/thushan/olla:latest), Go (go install github.com/thushan/olla@latest), or building from source. No specific non-default hardware or software prerequisites are detailed beyond standard OS and Docker support. Full documentation is available at https://thushan.github.io/olla/.

Highlighted Details

Smart Load Balancing: Priority-based routing with automatic failover and retries.
Sticky Sessions: KV-cache-aware affinity routing for multi-turn conversations.
Model Unification: Per-provider unification and cross-provider routing.
Dual Proxy Engines: Sherpa (simple) and Olla (high-performance).
Health Monitoring: Continuous endpoint health checks with circuit breakers.
Performance: Sub-millisecond endpoint selection, low memory footprint (<50MB RAM), streaming-first design.
Supported Backends: Ollama, LM Studio, vLLM, llama.cpp, SGLang, LMDeploy, and OpenAI-compatible endpoints.

Maintenance & Community

Developed by TensorFoundry. Key links include GitHub issues (https://github.com/thushan/olla/issues) and releases (https://github.com/thushan/olla/releases). No specific community chat channels are mentioned.

Licensing & Compatibility

Licensed under the Apache License 2.0, permissive for commercial use. Supports Linux, macOS, Windows, and Docker across AMD64 and ARM64 architectures.

Limitations & Caveats

The Anthropic Messages API translation is noted as "still actively being improved." Users may face limitations with highly custom or unsupported inference engines, potentially requiring manual integration efforts.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days