llmaz  by InftyAI

Advanced LLM inference platform for Kubernetes

Created 1 year ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

llmaz is an alpha-stage, production-ready inference platform for deploying large language models (LLMs) on Kubernetes. It targets engineers and power users seeking an efficient, scalable, and flexible solution for LLM serving, simplifying complex deployments by integrating state-of-the-art backends and offering robust cluster management.

How It Works

llmaz provides a unified Kubernetes-native interface for LLM inference, abstracting underlying complexities. It supports diverse inference backends (vLLM, TGI, SGLang, llama.cpp, TensorRT-LLM) and heterogeneous cluster configurations via the InftyAI Scheduler for cost-effective serving. Automatic model loading from providers like HuggingFace, coupled with Envoy AI Gateway integration for traffic management and autoscaling via HPA/Karpenter, streamlines operations.

Quick Start & Requirements

Installation follows standard Kubernetes deployment, detailed in the Installation guide. Requirements include a Kubernetes cluster and kubectl. HuggingFace tokens may be needed via secrets. Example YAMLs for models and playgrounds, along with verification commands, are provided. Further tutorials are in examples and develop.md.

Highlighted Details

  • Broad Backend Support: Integrates vLLM, TGI, SGLang, llama.cpp, TensorRT-LLM.
  • Heterogeneous Cluster Serving: Enables serving LLMs across diverse hardware via InftyAI Scheduler.
  • AI Gateway Integration: Uses Envoy for rate limiting and model routing.
  • Automated Scaling: Supports HPA with LLM metrics and Karpenter node autoscaling.
  • Integrated ChatUI: Includes Open WebUI for chatbot features (function call, RAG, web search).

Maintenance & Community

Active community channels exist on Discord and Slack (#llmaz). Contributions are welcomed via CONTRIBUTING.md. A roadmap includes serverless support and disaggregated serving. Fundraising is handled via OpenCollective.

Licensing & Compatibility

The README does not specify a software license. Users should verify licensing terms for commercial use or closed-source integration.

Limitations & Caveats

llmaz is in alpha, with potential API changes. Multi-host homogeneous distributed inference is supported; heterogeneous distributed inference is planned.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
12
Issues (30d)
0
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

torchchat by pytorch

0.1%
4k
PyTorch-native SDK for local LLM inference across diverse platforms
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.