ai-on-gke by GoogleCloudPlatform

AI/ML examples, best practices, and solutions for Google Kubernetes Engine

Created 2 years ago

327 stars

Top 83.6% on SourcePulse

Project Summary

This repository provides a collection of assets, best practices, and pre-built solutions for deploying and scaling AI/ML workloads on Google Kubernetes Engine (GKE). It targets engineers and researchers looking to build robust AI platforms, offering infrastructure orchestration for GPUs/TPUs, distributed computing integration, and multi-team resource utilization.

How It Works

The project leverages Terraform for infrastructure provisioning, enabling the deployment of GKE clusters (Standard and Autopilot) with support for GPUs and TPUs. It includes modules for common AI/ML components like JupyterHub for interactive development and Kuberay for distributed training and serving. The architecture focuses on providing a flexible and scalable foundation for diverse AI workloads.

Quick Start & Requirements

Install/Run: Use Terraform commands (terraform init, terraform apply -var-file platform.tfvars) after configuring a GCS bucket for state persistence and updating platform.tfvars.
Prerequisites: A functional GKE cluster is assumed. Requires Google Cloud SDK and Terraform.
Resources: Deployment involves provisioning GKE resources, potentially including GPUs.
Docs: infrastructure/README.md, applications/ray/README.md, tutorial.md.

Highlighted Details

Provides Terraform modules for deploying JupyterHub and Kuberay clusters on GKE.
Includes examples for various AI/ML use cases such as online serving, batch processing, and distributed training with frameworks like Ray and Jobset.
Offers solutions for GPU utilization monitoring (dcgm-on-gke) and custom GKE disk image building.
Features end-to-end tutorials for specific models like Llama 2 and Llama 7B on L4 GPUs.

Maintenance & Community

This is an official Google Cloud Platform repository. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The use of assets is subject to Google's AI Principles. The repository includes a LICENSE file, but its specific type and restrictions are not detailed in the README. Compatibility for commercial use or closed-source linking would require reviewing the LICENSE file.

Limitations & Caveats

The repository assumes a pre-existing GKE cluster for application deployment, though infrastructure modules are provided for cluster creation. Specific licensing terms and potential commercial use restrictions are not fully detailed in the README.

Health Check

Last Commit

6 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days