k8s-vgpu-scheduler  by 4paradigm

Kubernetes scheduler for virtualized GPUs

created 4 years ago
570 stars

Top 57.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Kubernetes device plugin and scheduler for virtualizing GPU resources, enabling fine-grained allocation of GPU memory and compute units. It targets AI/ML workloads and cloud platforms needing to maximize GPU utilization by sharing resources among multiple tasks or allowing oversubscription of GPU memory.

How It Works

The solution extends the NVIDIA device plugin for Kubernetes, allowing users to request fractional GPUs, specify memory limits (e.g., 3000MB or 50% of total), and even oversubscribe GPU memory by using host RAM as swap. It also supports specifying desired GPU types or avoiding certain types via annotations. The scheduler component balances GPU usage across nodes, aiming for improved resource utilization.

Quick Start & Requirements

  • Installation: Uses Helm. Add repo: helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler. Install: helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=<your-k8s-version> -n kube-system.
  • Prerequisites: NVIDIA drivers >= 384.81, nvidia-docker > 2.0, Kubernetes >= 1.16, glibc >= 2.17, kernel >= 3.10, helm > 3.0. Requires NVIDIA Container Toolkit configured as the default runtime. GPU nodes must be labeled gpu=on.
  • Resources: Official documentation and examples are available.

Highlighted Details

  • Enables GPU sharing, allowing multiple tasks to utilize portions of a single GPU.
  • Supports GPU memory allocation by absolute size or percentage.
  • Features "Virtual Device Memory" to oversubscribe GPU memory using host RAM.
  • Allows specifying GPU types to use or avoid via annotations.

Maintenance & Community

The project has been renamed to project-HAMi but the old repository is maintained for compatibility. Contact information for the owner/maintainer is provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Currently, A100 MIG supports only "none" and "mixed" modes. Tasks specifying nodeName are not supported; nodeSelector should be used instead. Only computing tasks are supported; video codec processing is not yet implemented.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.