kubedl  by kubedl-io

Easier and efficient deep learning on Kubernetes

Created 5 years ago
531 stars

Top 59.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

KubeDL simplifies and optimizes deep learning workloads on Kubernetes. Targeting ML engineers and platform operators, it provides a unified controller for training and inference, enhancing efficiency through advanced scheduling and resource management. As a CNCF sandbox project, it aims to streamline the deployment and operation of ML models in cloud-native environments.

How It Works

KubeDL employs a unified controller to manage diverse deep learning workloads, including training and inference for frameworks like TensorFlow, PyTorch, and Mars. Its architecture incorporates advanced scheduling, cache-based acceleration, metadata persistence, and file synchronization to boost performance and simplify operations. The system also features automatic configuration tuning for ML model deployment and integrates with Morphling for containerized model packaging, deployment, and lineage tracking via Kubernetes CRDs.

Quick Start & Requirements

  • Requires a Kubernetes cluster. Specific version and prerequisites for ML frameworks are not detailed in the provided information.
  • Further details may be available on the official website: https://kubedl.io.

Highlighted Details

  • CNCF sandbox project status indicates active development within the Cloud Native Computing Foundation ecosystem.
  • Supports multiple ML frameworks including TensorFlow, PyTorch, and Mars within a single controller.
  • Features include advanced scheduling, cache acceleration, metadata persistence, and file sync.
  • Includes automatic ML model deployment configuration tuning and integrates Morphling for model lineage tracking.
  • Related research published in "Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving" (ACM Socc 2021).

Maintenance & Community

  • Community engagement channels include DingTalk (discussions/usage), GitHub Issues (bugs/features), and a dedicated email list (cncf-kubedl-maintainers@lists.cncf.io) for specific topics.
  • Estimated response times range from less than a day to three days depending on the channel.

Licensing & Compatibility

  • The specific open-source license is not mentioned in the provided text. Compatibility for commercial use or integration with closed-source systems requires license clarification.

Limitations & Caveats

  • As a CNCF sandbox project, KubeDL may be in an early stage of development, potentially lacking mature features or stability guarantees.
  • The README does not specify installation instructions, detailed system requirements, or known limitations/bugs.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.