lws by kubernetes-sigs

Kubernetes API for deploying pods as a unit of replication

Created 2 years ago

670 stars

Top 50.3% on SourcePulse

Project Summary

The LeaderWorkerSet (LWS) API provides a Kubernetes-native solution for deploying replicated groups of pods, specifically targeting AI/ML inference workloads like sharded LLMs across multiple nodes. It allows users to define a "super pod" composed of a leader and multiple workers, managed as a single unit for scaling, rolling updates, and topology-aware placement, simplifying complex distributed deployments.

How It Works

LWS introduces a custom resource that defines a group of pods, comprising one leader and a configurable number of workers. This group is treated as an atomic unit for lifecycle management. It supports dual pod templates (one for the leader, one for workers) and enables parallel creation of pods within a group. The API facilitates topology-aware placement, ensuring pods within a group can be co-located, and offers an "all-or-nothing" restart policy for group-level failure handling.

Quick Start & Requirements

Installation requires applying the LWS Custom Resource Definition (CRD) to a Kubernetes cluster.
Refer to the installation guide for detailed instructions.
Examples are available to demonstrate usage.

Highlighted Details

Manages groups of pods as a single unit for rolling updates and scaling.
Supports unique pod identity within a group via indexing.
Enables topology-aware placement for co-location of pods within a group.
Exposes a scale subresource for Horizontal Pod Autoscaler (HPA) integration.

Maintenance & Community

Part of the Kubernetes SIG-Node and SIG-API-Machinery ecosystems.
Community engagement is managed through standard Kubernetes channels (Slack, Mailing List).
Governed by the Kubernetes Code of Conduct.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Compatible with commercial use and integration into closed-source applications.

Limitations & Caveats

The project is presented as an API and requires a Kubernetes environment for deployment and operation. Specific performance characteristics or resource requirements for AI/ML workloads are not detailed in the README.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days