optillm  by algorithmicsuperintelligence

Optimizing inference proxy for LLMs

Created 1 year ago
3,067 stars

Top 15.6% on SourcePulse

GitHubView on GitHub
Project Summary

OptiLLM is an OpenAI API-compatible inference proxy designed to enhance LLM performance and accuracy, particularly for coding, logical, and mathematical tasks. It targets developers and researchers seeking to improve LLM reasoning capabilities through advanced inference-time techniques.

How It Works

OptiLLM implements state-of-the-art techniques like Mixture of Agents (MoA), Monte Carlo Tree Search (MCTS), and Chain-of-Thought (CoT) decoding. These methods augment LLM responses by performing additional computations at inference time, aiming to surpass frontier models on complex queries. The proxy supports various optimization approaches, selectable via model name prefixes, extra_body parameters, or prompt tags.

Quick Start & Requirements

  • Install: pip install optillm or use Docker (docker pull ghcr.io/codelion/optillm:latest).
  • Prerequisites: Python 3.x. For local inference, HuggingFace models require HF_TOKEN. Supports various LLM providers via environment variables (e.g., OPENAI_API_KEY, GEMINI_API_KEY).
  • Usage: Run python optillm.py or docker run -p 8000:8000 ghcr.io/codelion/optillm:latest. Set base_url to http://localhost:8000/v1 in OpenAI client.
  • Docs: https://github.com/codelion/optillm

Highlighted Details

  • Supports local inference with HuggingFace models and LoRAs.
  • Integrates with external tools via the Model Context Protocol (MCP) plugin.
  • Achieves state-of-the-art results on benchmarks like LongBench v2 and HELMET.
  • Offers plugins for memory, privacy, URL reading, code execution, and structured JSON output.

Maintenance & Community

The project is actively developed, with contributions from Asankhaya Sharma. Community channels are not explicitly mentioned in the README.

Licensing & Compatibility

The project is available under an unspecified license. Its OpenAI API compatibility allows integration with existing tools and frameworks.

Limitations & Caveats

Some optimization techniques (e.g., cot_decoding, entropy_decoding) are not supported when using external servers like Anthropic API, llama.cpp, or Ollama due to their lack of multi-response sampling.

Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
9
Star History
122 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 15 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Daniel Han Daniel Han(Cofounder of Unsloth), and
18 more.

gpt-oss by openai

0.6%
19k
Open-weight LLMs for reasoning and agents
Created 4 months ago
Updated 3 days ago
Feedback? Help us improve.