GPU monitoring/diagnostics tool for AI/ML workloads
Top 73.0% on sourcepulse
GPUd is an open-source tool designed to enhance GPU reliability and efficiency in AI/ML workloads by automating monitoring, diagnostics, and issue identification. It targets users operating large-scale GPU clusters, aiming to minimize downtime and maintain high performance by proactively addressing common GPU hardware failures.
How It Works
GPUd operates as a self-contained binary, providing a GPU-centric view of critical metrics and issues. It integrates seamlessly with systems like Docker, containerd, Kubernetes, and the Nvidia ecosystem. The tool parses nvidia-smi
output, checks for common errors, and monitors system-level metrics, aiming for minimal CPU and memory overhead.
Quick Start & Requirements
curl -fsSL https://pkg.gpud.dev/install.sh | sh
gpud run
.sudo gpud up --token <LEPTON_AI_TOKEN>
sudo gpud up
Highlighted Details
kmsg
, hardware slowdowns, NVML Xid events, and DCGM.nvidia-smi
.Maintenance & Community
GPUd is developed by Lepton AI, drawing on experience from large-scale GPU cluster operations. Auto-updates are enabled by default when registered with the Lepton platform. Usage statistics are collected anonymously by default but can be disabled.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility with commercial or closed-source linking is not specified.
Limitations & Caveats
The installation script currently only supports Linux on amd64 (x86_64) architectures; macOS and other architectures require manual execution. The project is in active development, with auto-updates enabled by default when connected to the Lepton AI platform.
1 day ago
1 day