dynolog  by facebookincubator

Telemetry daemon for performance monitoring and tracing of heterogeneous CPU-GPU systems

created 3 years ago
326 stars

Top 84.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Dynolog is a telemetry daemon designed for comprehensive performance monitoring and tracing across heterogeneous CPU-GPU systems, primarily targeting large-scale AI training workloads. It provides a unified view of system performance by collecting metrics from the Linux kernel, CPUs, GPUs (NVIDIA via DCGM), and integrates with PyTorch for on-demand distributed tracing, simplifying bottleneck identification in complex AI environments.

How It Works

Dynolog operates as a daemon that continuously collects system-level metrics and can be remotely triggered for deep-dive profiling. It leverages Linux perf_event for CPU micro-architectural counters, NVIDIA's DCGM for GPU metrics, and integrates with the PyTorch profiler via an IPC monitor. This approach allows for both always-on monitoring and granular, application-specific tracing, offering a holistic performance picture by correlating hardware events with application behavior.

Quick Start & Requirements

  • Installation: Install via RPM (CentOS) or DEB (Ubuntu) packages. Users without sudo can extract packages to run in userspace.
  • Prerequisites: NVIDIA DCGM is required for GPU monitoring. C++17 (GCC 8.5.0+) and Rust (1.56+) toolchains are needed for building from source.
  • Setup: Pre-built packages offer a quick setup. Building from source requires installing cmake, ninja, and cargo/rustup.
  • Documentation: PyTorch Profiler Integration, Metrics Documentation.

Highlighted Details

  • On-demand remote tracing for PyTorch distributed training (v1.13.0+).
  • NVIDIA GPU monitoring via DCGM (Kepler/Volta onwards).
  • CPU performance monitoring using Linux perf_event (cache, TLB, etc.).
  • Support for Intel Processor Trace and memory latency/bandwidth monitoring is in development.

Maintenance & Community

Actively maintained by Meta engineers. Community interaction and bug reporting via GitHub Issues.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

Currently supports only Linux platforms and NVIDIA GPUs. Intel Processor Trace and memory monitoring are under active development. Some userspace features may have limitations without root access.

Health Check
Last commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
22
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

gpu.cpp by AnswerDotAI

0.2%
4k
C++ library for portable GPU computation using WebGPU
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.