gpud by leptonai

GPU monitoring/diagnostics tool for AI/ML workloads

Created 1 year ago

476 stars

Top 64.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Johannes Hagemann

Cofounder of Prime Intellect

Project Summary

GPUd is an open-source tool designed to enhance GPU reliability and efficiency in AI/ML workloads by automating monitoring, diagnostics, and issue identification. It targets users operating large-scale GPU clusters, aiming to minimize downtime and maintain high performance by proactively addressing common GPU hardware failures.

How It Works

GPUd operates as a self-contained binary, providing a GPU-centric view of critical metrics and issues. It integrates seamlessly with systems like Docker, containerd, Kubernetes, and the Nvidia ecosystem. The tool parses nvidia-smi output, checks for common errors, and monitors system-level metrics, aiming for minimal CPU and memory overhead.

Quick Start & Requirements

Installation: curl -fsSL https://pkg.gpud.dev/install.sh | sh
Prerequisites: Linux (amd64/x86_64 architecture supported by install script), systemd (default). macOS support via gpud run.
Running with Lepton Platform: sudo gpud up --token <LEPTON_AI_TOKEN>
Running Standalone: sudo gpud up
Kubernetes: Helm chart available.
Docs: https://docs.gpud.dev/

Highlighted Details

Monitors critical GPU and fabric metrics (power, temperature).
Detects errors via kmsg, hardware slowdowns, NVML Xid events, and DCGM.
Reports GPU and fabric status by parsing nvidia-smi.
Monitors overall system metrics (CPU, memory, disk).

Maintenance & Community

GPUd is developed by Lepton AI, drawing on experience from large-scale GPU cluster operations. Auto-updates are enabled by default when registered with the Lepton platform. Usage statistics are collected anonymously by default but can be disabled.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility with commercial or closed-source linking is not specified.

Limitations & Caveats

The installation script currently only supports Linux on amd64 (x86_64) architectures; macOS and other architectures require manual execution. The project is in active development, with auto-updates enabled by default when connected to the Lepton AI platform.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days