dynolog by facebookincubator

Telemetry daemon for performance monitoring and tracing of heterogeneous CPU-GPU systems

Created 3 years ago

359 stars

Top 78.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Zhiqiang Xie

Coauthor of SGLang

Project Summary

Dynolog is a telemetry daemon designed for comprehensive performance monitoring and tracing across heterogeneous CPU-GPU systems, primarily targeting large-scale AI training workloads. It provides a unified view of system performance by collecting metrics from the Linux kernel, CPUs, GPUs (NVIDIA via DCGM), and integrates with PyTorch for on-demand distributed tracing, simplifying bottleneck identification in complex AI environments.

How It Works

Dynolog operates as a daemon that continuously collects system-level metrics and can be remotely triggered for deep-dive profiling. It leverages Linux perf_event for CPU micro-architectural counters, NVIDIA's DCGM for GPU metrics, and integrates with the PyTorch profiler via an IPC monitor. This approach allows for both always-on monitoring and granular, application-specific tracing, offering a holistic performance picture by correlating hardware events with application behavior.

Quick Start & Requirements

Installation: Install via RPM (CentOS) or DEB (Ubuntu) packages. Users without sudo can extract packages to run in userspace.
Prerequisites: NVIDIA DCGM is required for GPU monitoring. C++17 (GCC 8.5.0+) and Rust (1.56+) toolchains are needed for building from source.
Setup: Pre-built packages offer a quick setup. Building from source requires installing cmake, ninja, and cargo/rustup.
Documentation: PyTorch Profiler Integration, Metrics Documentation.

Highlighted Details

On-demand remote tracing for PyTorch distributed training (v1.13.0+).
NVIDIA GPU monitoring via DCGM (Kepler/Volta onwards).
CPU performance monitoring using Linux perf_event (cache, TLB, etc.).
Support for Intel Processor Trace and memory latency/bandwidth monitoring is in development.

Maintenance & Community

Actively maintained by Meta engineers. Community interaction and bug reporting via GitHub Issues.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

Currently supports only Linux platforms and NVIDIA GPUs. Intel Processor Trace and memory monitoring are under active development. Some userspace features may have limitations without root access.

Health Check

Last Commit

1 week ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days