ome by sgl-project

Kubernetes operator for LLM serving

Created 7 months ago

355 stars

Top 78.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Lianmin Zheng

Coauthor of SGLang, vLLM

Ying Sheng

Coauthor of SGLang

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

OME addresses the complex challenge of managing and serving Large Language Models (LLMs) within Kubernetes environments. It targets engineers and researchers who need robust, automated solutions for deploying and optimizing LLM inference at scale, offering benefits like improved resource utilization and simplified operational overhead.

How It Works

OME operates as a Kubernetes operator, leveraging custom resources to define and manage LLMs as first-class citizens. It automates model parsing to extract critical metadata, intelligently selects optimal serving runtimes (like SGLang or Triton) based on model characteristics and weighted scoring, and orchestrates sophisticated deployment patterns. This approach optimizes GPU bin-packing and enables dynamic re-optimization for efficient resource utilization and high availability.

Quick Start & Requirements

Installation: Recommended via OCI Registry or Helm repository.
- OCI: helm upgrade --install ome-crd oci://ghcr.io/moirai-internal/charts/ome-crd --namespace ome --create-namespace and helm upgrade --install ome oci://ghcr.io/moirai-internal/charts/ome-resources --namespace ome
- Helm: helm repo add ome https://sgl-project.github.io/ome, helm repo update, then install CRDs and resources.
Prerequisites: Kubernetes 1.28 or newer.
Documentation: OME concepts, Common tasks, Installation guide.

Highlighted Details

Supports advanced deployment patterns including prefill-decode disaggregation and multi-node inference.
Integrates deeply with Kubernetes components like Kueue, LeaderWorkerSet, KEDA, and Gateway API.
Features automated benchmarking via the BenchmarkJob custom resource for performance evaluation.
Provides first-class support for SGLang, an advanced inference engine.

Maintenance & Community

Active development with a roadmap prioritizing enhanced model parsing and quantization support.
Community support via GitHub Issues for bug reports and feature requests.

Licensing & Compatibility