Telemetry daemon for performance monitoring and tracing of heterogeneous CPU-GPU systems
Top 84.8% on sourcepulse
Dynolog is a telemetry daemon designed for comprehensive performance monitoring and tracing across heterogeneous CPU-GPU systems, primarily targeting large-scale AI training workloads. It provides a unified view of system performance by collecting metrics from the Linux kernel, CPUs, GPUs (NVIDIA via DCGM), and integrates with PyTorch for on-demand distributed tracing, simplifying bottleneck identification in complex AI environments.
How It Works
Dynolog operates as a daemon that continuously collects system-level metrics and can be remotely triggered for deep-dive profiling. It leverages Linux perf_event
for CPU micro-architectural counters, NVIDIA's DCGM for GPU metrics, and integrates with the PyTorch profiler via an IPC monitor. This approach allows for both always-on monitoring and granular, application-specific tracing, offering a holistic performance picture by correlating hardware events with application behavior.
Quick Start & Requirements
cmake
, ninja
, and cargo
/rustup
.Highlighted Details
perf_event
(cache, TLB, etc.).Maintenance & Community
Actively maintained by Meta engineers. Community interaction and bug reporting via GitHub Issues.
Licensing & Compatibility
MIT License. Permissive for commercial use and integration with closed-source applications.
Limitations & Caveats
Currently supports only Linux platforms and NVIDIA GPUs. Intel Processor Trace and memory monitoring are under active development. Some userspace features may have limitations without root access.
3 days ago
Inactive