gpud  by leptonai

GPU monitoring/diagnostics tool for AI/ML workloads

Created 1 year ago
429 stars

Top 69.1% on SourcePulse

GitHubView on GitHub
Project Summary

GPUd is an open-source tool designed to enhance GPU reliability and efficiency in AI/ML workloads by automating monitoring, diagnostics, and issue identification. It targets users operating large-scale GPU clusters, aiming to minimize downtime and maintain high performance by proactively addressing common GPU hardware failures.

How It Works

GPUd operates as a self-contained binary, providing a GPU-centric view of critical metrics and issues. It integrates seamlessly with systems like Docker, containerd, Kubernetes, and the Nvidia ecosystem. The tool parses nvidia-smi output, checks for common errors, and monitors system-level metrics, aiming for minimal CPU and memory overhead.

Quick Start & Requirements

  • Installation: curl -fsSL https://pkg.gpud.dev/install.sh | sh
  • Prerequisites: Linux (amd64/x86_64 architecture supported by install script), systemd (default). macOS support via gpud run.
  • Running with Lepton Platform: sudo gpud up --token <LEPTON_AI_TOKEN>
  • Running Standalone: sudo gpud up
  • Kubernetes: Helm chart available.
  • Docs: https://docs.gpud.dev/

Highlighted Details

  • Monitors critical GPU and fabric metrics (power, temperature).
  • Detects errors via kmsg, hardware slowdowns, NVML Xid events, and DCGM.
  • Reports GPU and fabric status by parsing nvidia-smi.
  • Monitors overall system metrics (CPU, memory, disk).

Maintenance & Community

GPUd is developed by Lepton AI, drawing on experience from large-scale GPU cluster operations. Auto-updates are enabled by default when registered with the Lepton platform. Usage statistics are collected anonymously by default but can be disabled.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility with commercial or closed-source linking is not specified.

Limitations & Caveats

The installation script currently only supports Linux on amd64 (x86_64) architectures; macOS and other architectures require manual execution. The project is in active development, with auto-updates enabled by default when connected to the Lepton AI platform.

Health Check
Last Commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
39
Issues (30d)
3
Star History
19 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.3%
4k
GPU cluster manager for AI model deployment
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Han Wang Han Wang(Cofounder of Mintlify), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
6 more.

evidently by evidentlyai

0.3%
7k
Open-source framework for ML/LLM observability
Created 4 years ago
Updated 15 hours ago
Feedback? Help us improve.