llm-d by llm-d

Kubernetes-native framework for distributed LLM inference

Created 8 months ago

2,332 stars

Top 19.3% on SourcePulse

View on GitHub

4 Experts Love This Project

Chaoyu Yang

Founder of Bento

Chris Van Pelt

Cofounder of Weights & Biases

Woosuk Kwon

Coauthor of vLLM

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

llm-d is a Kubernetes-native framework for high-performance distributed LLM inference, targeting users who need to serve large language models at scale with efficient resource utilization. It offers a modular solution built on vLLM, Kubernetes, and Inference Gateway (IGW), aiming for fast time-to-value and competitive performance per dollar.

How It Works

llm-d leverages vLLM's capabilities for disaggregated serving (separating prefill and decode) and KV cache management. Its core innovation lies in the vLLM-Optimized Inference Scheduler, which uses IGW's Endpoint Picker Protocol (EPP) for customizable, telemetry-driven load balancing. This scheduler considers factors like KV-cache awareness, Service Level Objectives (SLOs), and load to make intelligent routing decisions, allowing advanced teams to implement custom scoring algorithms.

Quick Start & Requirements

Installation: Deploy as a full solution via a single Helm chart on Kubernetes. Individual components can be cloned for experimentation.
Prerequisites: Kubernetes cluster.
Resources: Deployment complexity and resource requirements depend on the scale of LLM serving.
Links: Quickstart, Project Overview

Highlighted Details

Built by contributors from Kubernetes and vLLM projects.
Supports disaggregated serving with vLLM for optimized prefill and decode.
Offers pluggable KV cache hierarchy management via vLLM's KVConnector.
Plans include variant autoscaling aware of hardware, workload, and traffic.

Maintenance & Community

Community launched by CoreWeave, Google, IBM Research, NVIDIA, and Red Hat.
Active development with weekly standups and a Slack channel for discussions.
Links: Slack, Google Group

Licensing & Compatibility

Licensed under Apache License 2.0.
Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

Variant Autoscaling feature is currently under development (marked with 🚧).
Some advanced disaggregated serving and KV caching schemes are planned for future implementation.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

175 stars in the last 30 days