Optimizing inference proxy for LLMs
Top 17.9% on sourcepulse
OptiLLM is an OpenAI API-compatible inference proxy designed to enhance LLM performance and accuracy, particularly for coding, logical, and mathematical tasks. It targets developers and researchers seeking to improve LLM reasoning capabilities through advanced inference-time techniques.
How It Works
OptiLLM implements state-of-the-art techniques like Mixture of Agents (MoA), Monte Carlo Tree Search (MCTS), and Chain-of-Thought (CoT) decoding. These methods augment LLM responses by performing additional computations at inference time, aiming to surpass frontier models on complex queries. The proxy supports various optimization approaches, selectable via model name prefixes, extra_body
parameters, or prompt tags.
Quick Start & Requirements
pip install optillm
or use Docker (docker pull ghcr.io/codelion/optillm:latest
).HF_TOKEN
. Supports various LLM providers via environment variables (e.g., OPENAI_API_KEY
, GEMINI_API_KEY
).python optillm.py
or docker run -p 8000:8000 ghcr.io/codelion/optillm:latest
. Set base_url
to http://localhost:8000/v1
in OpenAI client.Highlighted Details
Maintenance & Community
The project is actively developed, with contributions from Asankhaya Sharma. Community channels are not explicitly mentioned in the README.
Licensing & Compatibility
The project is available under an unspecified license. Its OpenAI API compatibility allows integration with existing tools and frameworks.
Limitations & Caveats
Some optimization techniques (e.g., cot_decoding
, entropy_decoding
) are not supported when using external servers like Anthropic API, llama.cpp, or Ollama due to their lack of multi-response sampling.
6 days ago
1 day