lws  by kubernetes-sigs

Kubernetes API for deploying pods as a unit of replication

created 1 year ago
526 stars

Top 60.9% on sourcepulse

GitHubView on GitHub
Project Summary

The LeaderWorkerSet (LWS) API provides a Kubernetes-native solution for deploying replicated groups of pods, specifically targeting AI/ML inference workloads like sharded LLMs across multiple nodes. It allows users to define a "super pod" composed of a leader and multiple workers, managed as a single unit for scaling, rolling updates, and topology-aware placement, simplifying complex distributed deployments.

How It Works

LWS introduces a custom resource that defines a group of pods, comprising one leader and a configurable number of workers. This group is treated as an atomic unit for lifecycle management. It supports dual pod templates (one for the leader, one for workers) and enables parallel creation of pods within a group. The API facilitates topology-aware placement, ensuring pods within a group can be co-located, and offers an "all-or-nothing" restart policy for group-level failure handling.

Quick Start & Requirements

  • Installation requires applying the LWS Custom Resource Definition (CRD) to a Kubernetes cluster.
  • Refer to the installation guide for detailed instructions.
  • Examples are available to demonstrate usage.

Highlighted Details

  • Manages groups of pods as a single unit for rolling updates and scaling.
  • Supports unique pod identity within a group via indexing.
  • Enables topology-aware placement for co-location of pods within a group.
  • Exposes a scale subresource for Horizontal Pod Autoscaler (HPA) integration.

Maintenance & Community

  • Part of the Kubernetes SIG-Node and SIG-API-Machinery ecosystems.
  • Community engagement is managed through standard Kubernetes channels (Slack, Mailing List).
  • Governed by the Kubernetes Code of Conduct.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and integration into closed-source applications.

Limitations & Caveats

The project is presented as an API and requires a Kubernetes environment for deployment and operation. Specific performance characteristics or resource requirements for AI/ML workloads are not detailed in the README.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
26
Issues (30d)
14
Star History
113 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.