llm-d  by llm-d

Kubernetes-native framework for distributed LLM inference

created 3 months ago
1,459 stars

Top 28.7% on sourcepulse

GitHubView on GitHub
Project Summary

llm-d is a Kubernetes-native framework for high-performance distributed LLM inference, targeting users who need to serve large language models at scale with efficient resource utilization. It offers a modular solution built on vLLM, Kubernetes, and Inference Gateway (IGW), aiming for fast time-to-value and competitive performance per dollar.

How It Works

llm-d leverages vLLM's capabilities for disaggregated serving (separating prefill and decode) and KV cache management. Its core innovation lies in the vLLM-Optimized Inference Scheduler, which uses IGW's Endpoint Picker Protocol (EPP) for customizable, telemetry-driven load balancing. This scheduler considers factors like KV-cache awareness, Service Level Objectives (SLOs), and load to make intelligent routing decisions, allowing advanced teams to implement custom scoring algorithms.

Quick Start & Requirements

  • Installation: Deploy as a full solution via a single Helm chart on Kubernetes. Individual components can be cloned for experimentation.
  • Prerequisites: Kubernetes cluster.
  • Resources: Deployment complexity and resource requirements depend on the scale of LLM serving.
  • Links: Quickstart, Project Overview

Highlighted Details

  • Built by contributors from Kubernetes and vLLM projects.
  • Supports disaggregated serving with vLLM for optimized prefill and decode.
  • Offers pluggable KV cache hierarchy management via vLLM's KVConnector.
  • Plans include variant autoscaling aware of hardware, workload, and traffic.

Maintenance & Community

  • Community launched by CoreWeave, Google, IBM Research, NVIDIA, and Red Hat.
  • Active development with weekly standups and a Slack channel for discussions.
  • Links: Slack, Google Group

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

  • Variant Autoscaling feature is currently under development (marked with 🚧).
  • Some advanced disaggregated serving and KV caching schemes are planned for future implementation.
Health Check
Last commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
21
Issues (30d)
13
Star History
1,485 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

dynamo by ai-dynamo

1.1%
5k
Inference framework for distributed generative AI model serving
created 5 months ago
updated 21 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 18 hours ago
Feedback? Help us improve.