Deep learning benchmark for hardware performance on core operations
Top 35.4% on sourcepulse
DeepBench is a project for benchmarking fundamental deep learning operations (GEMM, convolutions, recurrent layers, and all-reduce) across various hardware platforms. It targets hardware vendors and researchers seeking to understand performance bottlenecks in deep learning training and inference, providing low-level operation benchmarks rather than full model performance.
How It Works
DeepBench defines specific operation sizes and precision requirements for both training and inference. It utilizes vendor-supplied libraries (e.g., cuDNN, MKL) to ensure results reflect typical user experiences. The benchmark measures execution time and FLOPS for operations like dense matrix multiplies, convolutions (NCHW format), recurrent cells (RNN, LSTM, GRU), and All-Reduce communication patterns.
Quick Start & Requirements
make
commands with specified paths for CUDA, cuDNN, MPI, and NCCL.ARCH
for NVIDIA GPUs (e.g., ARCH=sm_61
).bin/
directory. Usage examples: bin/gemm_bench <inference|train> <int8|float|half>
, bin/nccl_single_all_reduce <num_gpus>
.results/
folder, with library specifics in Excel sheets.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 years ago
Inactive