smg by lightseekorg

High-performance LLM gateway for diverse inference backends

Created 7 months ago

385 stars

Top 73.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Eric Zhang

Founding Engineer at Modal

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Shepherd Model Gateway (SMG) is a high-performance, engine-agnostic LLM gateway built in Rust. It addresses the complexity of managing large-scale LLM deployments by centralizing worker lifecycle management and traffic balancing across diverse HTTP/gRPC/OpenAI-compatible backends. SMG offers enterprise-ready control over history storage, privacy, and custom logic, benefiting users aiming for efficient, unified, and observable LLM infrastructure.

How It Works

SMG leverages native Rust for speed, featuring a gRPC pipeline and sub-millisecond routing decisions. Its core differentiator is "cache-aware routing," which intelligently understands the KV cache state of inference engines (SGLang, vLLM, TensorRT-LLM) to reuse computation prefixes, thereby maximizing GPU utilization and reducing redundant work. It provides a single, unified API endpoint that routes requests to self-hosted models or various cloud providers, simplifying integration and abstracting backend diversity.

Quick Start & Requirements

Installation: Docker (docker pull lightseekorg/smg:latest), Python (pip install smg), or Rust (cargo install smg).
Prerequisites: Standard development environments for Docker, Python, or Rust. No specific hardware or software dependencies beyond the chosen installation method are detailed.
Links: Official documentation and guides are referenced implicitly within the README.

Highlighted Details

Performance: Built with native Rust, featuring a gRPC pipeline, sub-millisecond routing, zero-copy tokenization, circuit breakers, and automatic failover.
Routing Flexibility: Supports 8 routing policies, including cache_aware for KV cache optimization, prefix_hash, consistent_hashing, and round_robin.
Broad Backend Support: Integrates with self-hosted engines (vLLM, SGLang, TensorRT-LLM, Ollama, OpenAI-compatible) and cloud providers (OpenAI, Anthropic, Gemini, Bedrock, Azure OpenAI).
Enterprise Features: Offers multi-tenant rate limiting with OIDC, WASM plugins for custom logic, pluggable chat history storage (PostgreSQL, Oracle, Redis, in-memory), and high-availability mesh networking.
Observability: Provides 40+ Prometheus metrics, OpenTelemetry tracing, and structured JSON logs for detailed monitoring.

Maintenance & Community

The project welcomes contributions, with a reference to a "Contributing Guide." No specific community channels (e.g., Discord, Slack) or details on core maintainers, sponsorships, or roadmap are present in the provided text.

Licensing & Compatibility

The README does not specify the project's license or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

The provided README does not detail specific limitations, known bugs, alpha status, or unsupported platforms. The complexity of configuring and managing diverse LLM backends and enterprise features may present a practical adoption hurdle.

Health Check

Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)

195

Issues (30d)

Star History

61 stars in the last 30 days