ai-on-gke  by GoogleCloudPlatform

AI/ML examples, best practices, and solutions for Google Kubernetes Engine

Created 2 years ago
324 stars

Top 83.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a collection of assets, best practices, and pre-built solutions for deploying and scaling AI/ML workloads on Google Kubernetes Engine (GKE). It targets engineers and researchers looking to build robust AI platforms, offering infrastructure orchestration for GPUs/TPUs, distributed computing integration, and multi-team resource utilization.

How It Works

The project leverages Terraform for infrastructure provisioning, enabling the deployment of GKE clusters (Standard and Autopilot) with support for GPUs and TPUs. It includes modules for common AI/ML components like JupyterHub for interactive development and Kuberay for distributed training and serving. The architecture focuses on providing a flexible and scalable foundation for diverse AI workloads.

Quick Start & Requirements

  • Install/Run: Use Terraform commands (terraform init, terraform apply -var-file platform.tfvars) after configuring a GCS bucket for state persistence and updating platform.tfvars.
  • Prerequisites: A functional GKE cluster is assumed. Requires Google Cloud SDK and Terraform.
  • Resources: Deployment involves provisioning GKE resources, potentially including GPUs.
  • Docs: infrastructure/README.md, applications/ray/README.md, tutorial.md.

Highlighted Details

  • Provides Terraform modules for deploying JupyterHub and Kuberay clusters on GKE.
  • Includes examples for various AI/ML use cases such as online serving, batch processing, and distributed training with frameworks like Ray and Jobset.
  • Offers solutions for GPU utilization monitoring (dcgm-on-gke) and custom GKE disk image building.
  • Features end-to-end tutorials for specific models like Llama 2 and Llama 7B on L4 GPUs.

Maintenance & Community

This is an official Google Cloud Platform repository. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The use of assets is subject to Google's AI Principles. The repository includes a LICENSE file, but its specific type and restrictions are not detailed in the README. Compatibility for commercial use or closed-source linking would require reviewing the LICENSE file.

Limitations & Caveats

The repository assumes a pre-existing GKE cluster for application deployment, though infrastructure modules are provided for cluster creation. Specific licensing terms and potential commercial use restrictions are not fully detailed in the README.

Health Check
Last Commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chris Lattner Chris Lattner(Author of LLVM, Clang, Swift, Mojo, MLIR; Cofounder of Modular), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
11 more.

modular by modular

0.1%
25k
AI toolchain unifying fragmented AI deployment workflows
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.