AI/ML examples, best practices, and solutions for Google Kubernetes Engine
Top 85.3% on sourcepulse
This repository provides a collection of assets, best practices, and pre-built solutions for deploying and scaling AI/ML workloads on Google Kubernetes Engine (GKE). It targets engineers and researchers looking to build robust AI platforms, offering infrastructure orchestration for GPUs/TPUs, distributed computing integration, and multi-team resource utilization.
How It Works
The project leverages Terraform for infrastructure provisioning, enabling the deployment of GKE clusters (Standard and Autopilot) with support for GPUs and TPUs. It includes modules for common AI/ML components like JupyterHub for interactive development and Kuberay for distributed training and serving. The architecture focuses on providing a flexible and scalable foundation for diverse AI workloads.
Quick Start & Requirements
terraform init
, terraform apply -var-file platform.tfvars
) after configuring a GCS bucket for state persistence and updating platform.tfvars
.Highlighted Details
Maintenance & Community
This is an official Google Cloud Platform repository. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The use of assets is subject to Google's AI Principles. The repository includes a LICENSE file, but its specific type and restrictions are not detailed in the README. Compatibility for commercial use or closed-source linking would require reviewing the LICENSE file.
Limitations & Caveats
The repository assumes a pre-existing GKE cluster for application deployment, though infrastructure modules are provided for cluster creation. Specific licensing terms and potential commercial use restrictions are not fully detailed in the README.
1 month ago
1+ week