llmaz by InftyAI

Advanced LLM inference platform for Kubernetes

Created 2 years ago

284 stars

Top 92.1% on SourcePulse

Project Summary

llmaz is an alpha-stage, production-ready inference platform for deploying large language models (LLMs) on Kubernetes. It targets engineers and power users seeking an efficient, scalable, and flexible solution for LLM serving, simplifying complex deployments by integrating state-of-the-art backends and offering robust cluster management.

How It Works

llmaz provides a unified Kubernetes-native interface for LLM inference, abstracting underlying complexities. It supports diverse inference backends (vLLM, TGI, SGLang, llama.cpp, TensorRT-LLM) and heterogeneous cluster configurations via the InftyAI Scheduler for cost-effective serving. Automatic model loading from providers like HuggingFace, coupled with Envoy AI Gateway integration for traffic management and autoscaling via HPA/Karpenter, streamlines operations.

Quick Start & Requirements

Installation follows standard Kubernetes deployment, detailed in the Installation guide. Requirements include a Kubernetes cluster and kubectl. HuggingFace tokens may be needed via secrets. Example YAMLs for models and playgrounds, along with verification commands, are provided. Further tutorials are in examples and develop.md.

Highlighted Details

Broad Backend Support: Integrates vLLM, TGI, SGLang, llama.cpp, TensorRT-LLM.
Heterogeneous Cluster Serving: Enables serving LLMs across diverse hardware via InftyAI Scheduler.
AI Gateway Integration: Uses Envoy for rate limiting and model routing.
Automated Scaling: Supports HPA with LLM metrics and Karpenter node autoscaling.
Integrated ChatUI: Includes Open WebUI for chatbot features (function call, RAG, web search).

Maintenance & Community

Active community channels exist on Discord and Slack (#llmaz). Contributions are welcomed via CONTRIBUTING.md. A roadmap includes serverless support and disaggregated serving. Fundraising is handled via OpenCollective.

Licensing & Compatibility

The README does not specify a software license. Users should verify licensing terms for commercial use or closed-source integration.

Limitations & Caveats

llmaz is in alpha, with potential API changes. Multi-host homogeneous distributed inference is supported; heterogeneous distributed inference is planned.

llmaz by InftyAI

Explore Similar Projects

Kolosal by KolosalAI

LlamaBarn by ggml-org

aikit by kaito-project

ServerlessLLM by ServerlessLLM

xFasterTransformer by intel

parallax by GradientHQ

LLM-VM by anarchy-ai

LlamaEdge by LlamaEdge

kaito by kaito-project

distributed-llama by b4rtaz

torchchat by pytorch

llm-d by llm-d