trainer  by kubeflow

Kubernetes-native project for distributed ML training and LLM fine-tuning

created 8 years ago
1,872 stars

Top 23.7% on sourcepulse

GitHubView on GitHub
Project Summary

Kubeflow Trainer is a Kubernetes-native project for distributed ML model training and LLM fine-tuning, supporting frameworks like PyTorch, JAX, and TensorFlow, along with libraries such as HuggingFace and DeepSpeed. It targets ML engineers and researchers seeking to orchestrate complex training jobs on Kubernetes clusters, simplifying the development and deployment of scalable ML workloads.

How It Works

The project leverages Kubernetes Custom Resources APIs to define and manage training jobs. It acts as an orchestrator, allowing users to integrate various ML frameworks and libraries into Kubernetes-native Training Runtimes. This approach enables declarative management of distributed training configurations, abstracting away much of the underlying Kubernetes complexity.

Quick Start & Requirements

  • Installation and getting started details are available in the official Kubeflow documentation.
  • Requires a Kubernetes cluster.

Highlighted Details

  • Supports distributed training for PyTorch, JAX, TensorFlow, HuggingFace, DeepSpeed, and Megatron-LM.
  • Enables LLM fine-tuning and large-scale ML model training.
  • Kubernetes-native design for orchestration and management.
  • Can be integrated with the Kubeflow Python SDK.

Maintenance & Community

  • Active community engagement via a dedicated Slack channel (#kubeflow-trainer) and bi-weekly AutoML and Training Working Group meetings.
  • Contributions are welcomed via the CONTRIBUTING guide.
  • Changelog is available.

Licensing & Compatibility

  • The project's specific license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The Kubeflow Trainer project is currently in alpha status, meaning its APIs are subject to change. While Kubeflow Training Operator V1 is maintained, users migrating from it should consult the provided migration document.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
48
Issues (30d)
29
Star History
109 stars in the last 90 days

Explore Similar Projects

Starred by Eugene Yan Eugene Yan(AI Scientist at AWS), Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

seldon-core by SeldonIO

0.1%
5k
MLOps framework for production model deployment on Kubernetes
created 7 years ago
updated 1 day ago
Feedback? Help us improve.