Discover and explore top open-source AI tools and projects—updated daily.
Advanced LLM inference platform for Kubernetes
Top 99.4% on SourcePulse
llmaz
is an alpha-stage, production-ready inference platform for deploying large language models (LLMs) on Kubernetes. It targets engineers and power users seeking an efficient, scalable, and flexible solution for LLM serving, simplifying complex deployments by integrating state-of-the-art backends and offering robust cluster management.
How It Works
llmaz
provides a unified Kubernetes-native interface for LLM inference, abstracting underlying complexities. It supports diverse inference backends (vLLM, TGI, SGLang, llama.cpp, TensorRT-LLM) and heterogeneous cluster configurations via the InftyAI Scheduler for cost-effective serving. Automatic model loading from providers like HuggingFace, coupled with Envoy AI Gateway integration for traffic management and autoscaling via HPA/Karpenter, streamlines operations.
Quick Start & Requirements
Installation follows standard Kubernetes deployment, detailed in the Installation
guide. Requirements include a Kubernetes cluster and kubectl
. HuggingFace tokens may be needed via secrets. Example YAMLs for models and playgrounds, along with verification commands, are provided. Further tutorials are in examples
and develop.md
.
Highlighted Details
Maintenance & Community
Active community channels exist on Discord and Slack (#llmaz). Contributions are welcomed via CONTRIBUTING.md
. A roadmap includes serverless support and disaggregated serving. Fundraising is handled via OpenCollective.
Licensing & Compatibility
The README does not specify a software license. Users should verify licensing terms for commercial use or closed-source integration.
Limitations & Caveats
llmaz
is in alpha, with potential API changes. Multi-host homogeneous distributed inference is supported; heterogeneous distributed inference is planned.
3 days ago
Inactive