optillm  by codelion

Optimizing inference proxy for LLMs

created 11 months ago
2,695 stars

Top 17.9% on sourcepulse

GitHubView on GitHub
Project Summary

OptiLLM is an OpenAI API-compatible inference proxy designed to enhance LLM performance and accuracy, particularly for coding, logical, and mathematical tasks. It targets developers and researchers seeking to improve LLM reasoning capabilities through advanced inference-time techniques.

How It Works

OptiLLM implements state-of-the-art techniques like Mixture of Agents (MoA), Monte Carlo Tree Search (MCTS), and Chain-of-Thought (CoT) decoding. These methods augment LLM responses by performing additional computations at inference time, aiming to surpass frontier models on complex queries. The proxy supports various optimization approaches, selectable via model name prefixes, extra_body parameters, or prompt tags.

Quick Start & Requirements

  • Install: pip install optillm or use Docker (docker pull ghcr.io/codelion/optillm:latest).
  • Prerequisites: Python 3.x. For local inference, HuggingFace models require HF_TOKEN. Supports various LLM providers via environment variables (e.g., OPENAI_API_KEY, GEMINI_API_KEY).
  • Usage: Run python optillm.py or docker run -p 8000:8000 ghcr.io/codelion/optillm:latest. Set base_url to http://localhost:8000/v1 in OpenAI client.
  • Docs: https://github.com/codelion/optillm

Highlighted Details

  • Supports local inference with HuggingFace models and LoRAs.
  • Integrates with external tools via the Model Context Protocol (MCP) plugin.
  • Achieves state-of-the-art results on benchmarks like LongBench v2 and HELMET.
  • Offers plugins for memory, privacy, URL reading, code execution, and structured JSON output.

Maintenance & Community

The project is actively developed, with contributions from Asankhaya Sharma. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is available under an unspecified license. Its OpenAI API compatibility allows integration with existing tools and frameworks.

Limitations & Caveats

Some optimization techniques (e.g., cot_decoding, entropy_decoding) are not supported when using external servers like Anthropic API, llama.cpp, or Ollama due to their lack of multi-response sampling.

Health Check
Last commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)
12
Issues (30d)
3
Star History
525 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
5 more.

lmql by eth-sri

0.2%
4k
LMQL: Language for constraint-guided LLM programming
created 2 years ago
updated 2 months ago
Feedback? Help us improve.