Kubernetes-native framework for distributed LLM inference
Top 28.7% on sourcepulse
llm-d is a Kubernetes-native framework for high-performance distributed LLM inference, targeting users who need to serve large language models at scale with efficient resource utilization. It offers a modular solution built on vLLM, Kubernetes, and Inference Gateway (IGW), aiming for fast time-to-value and competitive performance per dollar.
How It Works
llm-d leverages vLLM's capabilities for disaggregated serving (separating prefill and decode) and KV cache management. Its core innovation lies in the vLLM-Optimized Inference Scheduler, which uses IGW's Endpoint Picker Protocol (EPP) for customizable, telemetry-driven load balancing. This scheduler considers factors like KV-cache awareness, Service Level Objectives (SLOs), and load to make intelligent routing decisions, allowing advanced teams to implement custom scoring algorithms.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 days ago
Inactive