kubeai  by substratusai

Kubernetes operator for production ML model serving

Created 1 year ago
1,059 stars

Top 35.7% on SourcePulse

GitHubView on GitHub
Project Summary

KubeAI is an AI Inference Operator for Kubernetes designed to simplify the deployment and scaling of machine learning models, particularly LLMs, embeddings, and speech-to-text models, in production environments. It targets Kubernetes users seeking an "it just works" solution for serving AI workloads, offering features like intelligent scaling, optimized routing, and model caching.

How It Works

KubeAI comprises a model proxy and a model operator. The proxy provides an OpenAI-compatible API and implements a novel prefix-aware load balancing strategy to optimize KV cache utilization for backend serving engines like vLLM, outperforming standard Kubernetes Services. The operator manages backend Pods, automating model downloads, volume mounting, and LoRA adapter orchestration via a custom resource definition (CRD). This architecture aims for simplicity by avoiding dependencies on external systems like Istio or Knative.

Quick Start & Requirements

  • Install: helm install kubeai kubeai/kubeai --wait --timeout 10m
  • Prerequisites: Kubernetes cluster (local with kind or minikube is supported), Helm. Podman users may need to adjust machine memory.
  • Models: Deploy predefined models using a YAML configuration and Helm.
  • Docs: kubeai.org

Highlighted Details

  • Supports LLM inferencing (vLLM, Ollama), speech processing (FasterWhisper), and vector embeddings (Infinity).
  • Features intelligent scale-from-zero, optimized routing for improved TTFT and throughput, automated model caching, and dynamic LoRA adapter orchestration.
  • Offers OpenAI API compatibility for seamless integration with existing client libraries.
  • Zero dependencies on Istio, Knative, or Prometheus metrics adapter.

Maintenance & Community

  • Known adopters include Telescope, Google Cloud Distributed Edge, Lambda, Vultr, Arcee, and Seeweb.
  • Community discussion available on Discord. Contact information for maintainers Nick Stogner and Sam Stoelinga is provided.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project's license is not clearly stated in the README, which may pose a risk for commercial adoption or integration into closed-source projects.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
20
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
8 more.

lorax by predibase

0.2%
3k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Created 1 year ago
Updated 4 months ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 13 hours ago
Feedback? Help us improve.