kubeai  by kubeai-project

Kubernetes operator for production ML model serving

Created 2 years ago
1,179 stars

Top 32.7% on SourcePulse

GitHubView on GitHub
Project Summary

KubeAI is an AI Inference Operator for Kubernetes designed to simplify the deployment and scaling of machine learning models, particularly LLMs, embeddings, and speech-to-text models, in production environments. It targets Kubernetes users seeking an "it just works" solution for serving AI workloads, offering features like intelligent scaling, optimized routing, and model caching.

How It Works

KubeAI comprises a model proxy and a model operator. The proxy provides an OpenAI-compatible API and implements a novel prefix-aware load balancing strategy to optimize KV cache utilization for backend serving engines like vLLM, outperforming standard Kubernetes Services. The operator manages backend Pods, automating model downloads, volume mounting, and LoRA adapter orchestration via a custom resource definition (CRD). This architecture aims for simplicity by avoiding dependencies on external systems like Istio or Knative.

Quick Start & Requirements

  • Install: helm install kubeai kubeai/kubeai --wait --timeout 10m
  • Prerequisites: Kubernetes cluster (local with kind or minikube is supported), Helm. Podman users may need to adjust machine memory.
  • Models: Deploy predefined models using a YAML configuration and Helm.
  • Docs: kubeai.org

Highlighted Details

  • Supports LLM inferencing (vLLM, Ollama), speech processing (FasterWhisper), and vector embeddings (Infinity).
  • Features intelligent scale-from-zero, optimized routing for improved TTFT and throughput, automated model caching, and dynamic LoRA adapter orchestration.
  • Offers OpenAI API compatibility for seamless integration with existing client libraries.
  • Zero dependencies on Istio, Knative, or Prometheus metrics adapter.

Maintenance & Community

  • Known adopters include Telescope, Google Cloud Distributed Edge, Lambda, Vultr, Arcee, and Seeweb.
  • Community discussion available on Discord. Contact information for maintainers Nick Stogner and Sam Stoelinga is provided.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project's license is not clearly stated in the README, which may pose a risk for commercial adoption or integration into closed-source projects.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
7
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Matthew Johnson Matthew Johnson(Coauthor of JAX; Research Scientist at Google Brain), Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), and
3 more.

sglang-jax by sgl-project

1.5%
264
High-performance LLM inference engine for JAX/TPU serving
Created 8 months ago
Updated 1 day ago
Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
3 more.

llm-d by llm-d

1.7%
3k
Kubernetes-native framework for distributed LLM inference
Created 11 months ago
Updated 1 day ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

vllm-omni by vllm-project

5.2%
4k
Omni-modality model inference and serving framework
Created 7 months ago
Updated 1 day ago
Feedback? Help us improve.