gpud  by leptonai

GPU monitoring/diagnostics tool for AI/ML workloads

created 11 months ago
403 stars

Top 73.0% on sourcepulse

GitHubView on GitHub
Project Summary

GPUd is an open-source tool designed to enhance GPU reliability and efficiency in AI/ML workloads by automating monitoring, diagnostics, and issue identification. It targets users operating large-scale GPU clusters, aiming to minimize downtime and maintain high performance by proactively addressing common GPU hardware failures.

How It Works

GPUd operates as a self-contained binary, providing a GPU-centric view of critical metrics and issues. It integrates seamlessly with systems like Docker, containerd, Kubernetes, and the Nvidia ecosystem. The tool parses nvidia-smi output, checks for common errors, and monitors system-level metrics, aiming for minimal CPU and memory overhead.

Quick Start & Requirements

  • Installation: curl -fsSL https://pkg.gpud.dev/install.sh | sh
  • Prerequisites: Linux (amd64/x86_64 architecture supported by install script), systemd (default). macOS support via gpud run.
  • Running with Lepton Platform: sudo gpud up --token <LEPTON_AI_TOKEN>
  • Running Standalone: sudo gpud up
  • Kubernetes: Helm chart available.
  • Docs: https://docs.gpud.dev/

Highlighted Details

  • Monitors critical GPU and fabric metrics (power, temperature).
  • Detects errors via kmsg, hardware slowdowns, NVML Xid events, and DCGM.
  • Reports GPU and fabric status by parsing nvidia-smi.
  • Monitors overall system metrics (CPU, memory, disk).

Maintenance & Community

GPUd is developed by Lepton AI, drawing on experience from large-scale GPU cluster operations. Auto-updates are enabled by default when registered with the Lepton platform. Usage statistics are collected anonymously by default but can be disabled.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility with commercial or closed-source linking is not specified.

Limitations & Caveats

The installation script currently only supports Linux on amd64 (x86_64) architectures; macOS and other architectures require manual execution. The project is in active development, with auto-updates enabled by default when connected to the Lepton AI platform.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
65
Issues (30d)
1
Star History
57 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.5%
3k
GPU cluster manager for AI model deployment
created 1 year ago
updated 3 days ago
Feedback? Help us improve.