k8s-vgpu-scheduler by 4paradigm

Kubernetes scheduler for virtualized GPUs

Created 4 years ago

582 stars

Top 55.7% on SourcePulse

Project Summary

This project provides a Kubernetes device plugin and scheduler for virtualizing GPU resources, enabling fine-grained allocation of GPU memory and compute units. It targets AI/ML workloads and cloud platforms needing to maximize GPU utilization by sharing resources among multiple tasks or allowing oversubscription of GPU memory.

How It Works

The solution extends the NVIDIA device plugin for Kubernetes, allowing users to request fractional GPUs, specify memory limits (e.g., 3000MB or 50% of total), and even oversubscribe GPU memory by using host RAM as swap. It also supports specifying desired GPU types or avoiding certain types via annotations. The scheduler component balances GPU usage across nodes, aiming for improved resource utilization.

Quick Start & Requirements

Installation: Uses Helm. Add repo: helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler. Install: helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=<your-k8s-version> -n kube-system.
Prerequisites: NVIDIA drivers >= 384.81, nvidia-docker > 2.0, Kubernetes >= 1.16, glibc >= 2.17, kernel >= 3.10, helm > 3.0. Requires NVIDIA Container Toolkit configured as the default runtime. GPU nodes must be labeled gpu=on.
Resources: Official documentation and examples are available.

Highlighted Details

Enables GPU sharing, allowing multiple tasks to utilize portions of a single GPU.
Supports GPU memory allocation by absolute size or percentage.
Features "Virtual Device Memory" to oversubscribe GPU memory using host RAM.
Allows specifying GPU types to use or avoid via annotations.

Maintenance & Community

The project has been renamed to project-HAMi but the old repository is maintained for compatibility. Contact information for the owner/maintainer is provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, A100 MIG supports only "none" and "mixed" modes. Tasks specifying nodeName are not supported; nodeSelector should be used instead. Only computing tasks are supported; video codec processing is not yet implemented.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days