optillm by algorithmicsuperintelligence

Optimizing inference proxy for LLMs

Created 1 year ago

3,264 stars

Top 14.7% on SourcePulse

View on GitHub

13 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Sebastian Raschka

Author of "Build a Large Language Model (From Scratch)"

Casper Hansen

Author of AutoAWQ

Marc Klingen

Cofounder of Langfuse

and 9 more!

Project Summary

OptiLLM is an OpenAI API-compatible inference proxy designed to enhance LLM performance and accuracy, particularly for coding, logical, and mathematical tasks. It targets developers and researchers seeking to improve LLM reasoning capabilities through advanced inference-time techniques.

How It Works

OptiLLM implements state-of-the-art techniques like Mixture of Agents (MoA), Monte Carlo Tree Search (MCTS), and Chain-of-Thought (CoT) decoding. These methods augment LLM responses by performing additional computations at inference time, aiming to surpass frontier models on complex queries. The proxy supports various optimization approaches, selectable via model name prefixes, extra_body parameters, or prompt tags.

Quick Start & Requirements

Install: pip install optillm or use Docker (docker pull ghcr.io/codelion/optillm:latest).
Prerequisites: Python 3.x. For local inference, HuggingFace models require HF_TOKEN. Supports various LLM providers via environment variables (e.g., OPENAI_API_KEY, GEMINI_API_KEY).
Usage: Run python optillm.py or docker run -p 8000:8000 ghcr.io/codelion/optillm:latest. Set base_url to http://localhost:8000/v1 in OpenAI client.
Docs: https://github.com/codelion/optillm

Highlighted Details

Supports local inference with HuggingFace models and LoRAs.
Integrates with external tools via the Model Context Protocol (MCP) plugin.
Achieves state-of-the-art results on benchmarks like LongBench v2 and HELMET.
Offers plugins for memory, privacy, URL reading, code execution, and structured JSON output.

Maintenance & Community

The project is actively developed, with contributions from Asankhaya Sharma. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is available under an unspecified license. Its OpenAI API compatibility allows integration with existing tools and frameworks.

Limitations & Caveats

Some optimization techniques (e.g., cot_decoding, entropy_decoding) are not supported when using external servers like Anthropic API, llama.cpp, or Ollama due to their lack of multi-response sampling.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

55 stars in the last 30 days