semantic-router  by vllm-project

Intelligent LLM routing for efficient inference

Created 3 weeks ago

New!

1,201 stars

Top 32.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an intelligent Mixture-of-Models (MoM) router designed to enhance LLM inference efficiency and accuracy by directing requests to the most suitable model from a pool based on semantic understanding. It targets developers and researchers seeking to optimize LLM deployments, offering benefits like improved accuracy, reduced latency, and enhanced security through features like PII detection and prompt guarding.

How It Works

The router employs BERT classification to semantically understand the intent, complexity, and task requirements of incoming requests. This allows it to intelligently select the most appropriate LLM from a configured set, akin to how Mixture-of-Experts (MoE) operates within a single model but applied at the model selection level. This approach leverages specialized models for specific tasks, leading to better overall inference accuracy and efficiency. It also supports automatic tool selection based on prompt relevance and includes similarity caching for prompt representations to reduce token usage and latency.

Quick Start & Requirements

  • Installation: Instructions are available in the Complete Documentation.
  • Prerequisites: Python and Golang (with Rust FFI via Candle) are mentioned as implementation languages. Specific version requirements are not detailed in the README.
  • Resources: No specific hardware or resource estimates are provided.

Highlighted Details

  • Implements an Intelligent Mixture-of-Models (MoM) routing strategy.
  • Features PII detection and prompt guard capabilities for enhanced security.
  • Includes similarity caching to improve inference latency and reduce token usage.
  • Offers both Golang and Python implementations, with benchmarking planned.

Maintenance & Community

  • The project is associated with the vLLM project.
  • Community support is available via the #semantic-router channel in the vLLM Slack.
  • A citation is provided for academic use.

Licensing & Compatibility

  • The project is licensed under the Apache 2.0 license.
  • The Apache 2.0 license is generally permissive for commercial use and closed-source linking.

Limitations & Caveats

The README indicates that benchmarking between the Golang and Python implementations is planned, suggesting performance characteristics are not yet finalized. Specific version requirements for prerequisites and detailed resource estimations for setup and operation are not provided.

Health Check
Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)
93
Issues (30d)
72
Star History
1,239 stars in the last 23 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Shyamal Anadkat Shyamal Anadkat(Research Scientist at OpenAI), and
12 more.

harmony by openai

0.5%
4k
Renderer for OpenAI's harmony response format
Created 1 month ago
Updated 1 month ago
Feedback? Help us improve.