dynolog  by facebookincubator

Telemetry daemon for performance monitoring and tracing of heterogeneous CPU-GPU systems

Created 3 years ago
338 stars

Top 81.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Dynolog is a telemetry daemon designed for comprehensive performance monitoring and tracing across heterogeneous CPU-GPU systems, primarily targeting large-scale AI training workloads. It provides a unified view of system performance by collecting metrics from the Linux kernel, CPUs, GPUs (NVIDIA via DCGM), and integrates with PyTorch for on-demand distributed tracing, simplifying bottleneck identification in complex AI environments.

How It Works

Dynolog operates as a daemon that continuously collects system-level metrics and can be remotely triggered for deep-dive profiling. It leverages Linux perf_event for CPU micro-architectural counters, NVIDIA's DCGM for GPU metrics, and integrates with the PyTorch profiler via an IPC monitor. This approach allows for both always-on monitoring and granular, application-specific tracing, offering a holistic performance picture by correlating hardware events with application behavior.

Quick Start & Requirements

  • Installation: Install via RPM (CentOS) or DEB (Ubuntu) packages. Users without sudo can extract packages to run in userspace.
  • Prerequisites: NVIDIA DCGM is required for GPU monitoring. C++17 (GCC 8.5.0+) and Rust (1.56+) toolchains are needed for building from source.
  • Setup: Pre-built packages offer a quick setup. Building from source requires installing cmake, ninja, and cargo/rustup.
  • Documentation: PyTorch Profiler Integration, Metrics Documentation.

Highlighted Details

  • On-demand remote tracing for PyTorch distributed training (v1.13.0+).
  • NVIDIA GPU monitoring via DCGM (Kepler/Volta onwards).
  • CPU performance monitoring using Linux perf_event (cache, TLB, etc.).
  • Support for Intel Processor Trace and memory latency/bandwidth monitoring is in development.

Maintenance & Community

Actively maintained by Meta engineers. Community interaction and bug reporting via GitHub Issues.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

Currently supports only Linux platforms and NVIDIA GPUs. Intel Processor Trace and memory monitoring are under active development. Some userspace features may have limitations without root access.

Health Check
Last Commit

2 days ago

Responsiveness

1+ week

Pull Requests (30d)
13
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.3%
4k
GPU cluster manager for AI model deployment
Created 1 year ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.