kubeai by kubeai-project

Kubernetes operator for production ML model serving

Created 2 years ago

1,124 stars

Top 34.1% on SourcePulse

4 Experts Love This Project

philschmid

DevRel at Google DeepMind

tholor

Cofounder of deepset

bobvanluijt

Cofounder of Weaviate

etiennedi

Etienne Dilocker

Cofounder of Weaviate

Project Summary

KubeAI is an AI Inference Operator for Kubernetes designed to simplify the deployment and scaling of machine learning models, particularly LLMs, embeddings, and speech-to-text models, in production environments. It targets Kubernetes users seeking an "it just works" solution for serving AI workloads, offering features like intelligent scaling, optimized routing, and model caching.

How It Works

KubeAI comprises a model proxy and a model operator. The proxy provides an OpenAI-compatible API and implements a novel prefix-aware load balancing strategy to optimize KV cache utilization for backend serving engines like vLLM, outperforming standard Kubernetes Services. The operator manages backend Pods, automating model downloads, volume mounting, and LoRA adapter orchestration via a custom resource definition (CRD). This architecture aims for simplicity by avoiding dependencies on external systems like Istio or Knative.

Quick Start & Requirements

Install: helm install kubeai kubeai/kubeai --wait --timeout 10m
Prerequisites: Kubernetes cluster (local with kind or minikube is supported), Helm. Podman users may need to adjust machine memory.
Models: Deploy predefined models using a YAML configuration and Helm.
Docs: kubeai.org

Highlighted Details

Supports LLM inferencing (vLLM, Ollama), speech processing (FasterWhisper), and vector embeddings (Infinity).
Features intelligent scale-from-zero, optimized routing for improved TTFT and throughput, automated model caching, and dynamic LoRA adapter orchestration.
Offers OpenAI API compatibility for seamless integration with existing client libraries.
Zero dependencies on Istio, Knative, or Prometheus metrics adapter.

Maintenance & Community

Known adopters include Telescope, Google Cloud Distributed Edge, Lambda, Vultr, Arcee, and Seeweb.
Community discussion available on Discord. Contact information for maintainers Nick Stogner and Sam Stoelinga is provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project's license is not clearly stated in the README, which may pose a risk for commercial adoption or integration into closed-source projects.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

2

Issues (30d)

0

Star History

18 stars in the last 30 days

Explore Similar Projects

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dash-infer by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago

Updated 5 months ago

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect),

Shishir Patil

Shishir Patil(Author of BFCL, Gorilla), and

2 more.

tokasaurus by ScalingIntelligence

LLM inference engine for high-throughput workloads

Created 7 months ago

Updated 1 month ago

candle-vllm by EricLBuehler

Platform for local LLM inference and serving with OpenAI API compatibility

Created 2 years ago

Updated 3 days ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow) and

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

distributed-llama by b4rtaz

CLI tool for distributed LLM inference across networked devices

Created 2 years ago

Updated 1 month ago

model_server by openvinotoolkit

Scalable inference server for OpenVINO-optimized models

Created 7 years ago

Updated 1 day ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ) and

Zhen Lu

Zhen Lu(Cofounder of Runpod).

worker-vllm by runpod-workers

RunPod worker template for blazing-fast LLM endpoints

Created 2 years ago

Updated 2 days ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

1 more.

vllm-omni by vllm-project

Omni-modality model inference and serving framework

Created 4 months ago

Updated 14 hours ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento),

Chris Van Pelt

Chris Van Pelt(Cofounder of Weights & Biases), and

2 more.

llm-d by llm-d

Kubernetes-native framework for distributed LLM inference

Created 8 months ago

Updated 1 day ago

Starred by

Jason Knight

Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

8 more.

lorax by predibase

Multi-LoRA inference server for serving 1000s of fine-tuned LLMs

Created 2 years ago

Updated 7 months ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento).

aibrix by vllm-project

Cloud-native infrastructure for scalable GenAI inference

Created 1 year ago

Updated 13 hours ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Carol Willing

Carol Willing(Core Contributor to CPython, Jupyter), and

10 more.

dynamo by ai-dynamo

Inference framework for distributed generative AI model serving

Created 10 months ago

Updated 12 hours ago

Starred by

Zhuohan Li

Zhuohan Li(Coauthor of vLLM),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

3 more.

kserve by kserve

Kubernetes CRD for scalable ML model serving

Created 6 years ago

Updated 3 days ago

Feedback? Help us improve.