DeepBench  by baidu-research

Deep learning benchmark for hardware performance on core operations

Created 9 years ago
1,095 stars

Top 34.8% on SourcePulse

GitHubView on GitHub
Project Summary

DeepBench is a project for benchmarking fundamental deep learning operations (GEMM, convolutions, recurrent layers, and all-reduce) across various hardware platforms. It targets hardware vendors and researchers seeking to understand performance bottlenecks in deep learning training and inference, providing low-level operation benchmarks rather than full model performance.

How It Works

DeepBench defines specific operation sizes and precision requirements for both training and inference. It utilizes vendor-supplied libraries (e.g., cuDNN, MKL) to ensure results reflect typical user experiences. The benchmark measures execution time and FLOPS for operations like dense matrix multiplies, convolutions (NCHW format), recurrent cells (RNN, LSTM, GRU), and All-Reduce communication patterns.

Quick Start & Requirements

  • Installation: Clone the repository and use make commands with specified paths for CUDA, cuDNN, MPI, and NCCL.
  • Prerequisites: CUDA Toolkit (tested with 7.5.18), cuDNN (tested with 5.0), OpenMPI (tested with 1.10.2), NCCL (tested with specific commit). For specific hardware: ROCm, MIOpen, rocBLAS for AMD; ARM Compute Library and Eigen for ARM.
  • Compilation: Requires specifying paths and potentially ARCH for NVIDIA GPUs (e.g., ARCH=sm_61).
  • Running: Executables are in the bin/ directory. Usage examples: bin/gemm_bench <inference|train> <int8|float|half>, bin/nccl_single_all_reduce <num_gpus>.
  • Documentation: Detailed results are in the results/ folder, with library specifics in Excel sheets.

Highlighted Details

  • Benchmarks training with FP16 inputs/FP32 math and inference with INT8 multiply/FP32 accumulate.
  • Includes benchmarks for server-grade GPUs (NVIDIA, Intel Xeon Phi) and mobile devices (iPhone, Raspberry Pi).
  • Tests various All-Reduce implementations (NCCL, OSU, Baidu Allreduce, Intel MLSL) across different network topologies.
  • Supports sparse matrix operations, aiming to incentivize better performance for 90-95% sparsity.

Maintenance & Community

  • The project is hosted on GitHub by Baidu Research.
  • Contributions are welcomed from researchers and hardware vendors for new operations, workloads, and hardware platforms.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

  • Does not measure end-to-end model training or inference latency, focusing only on isolated operations.
  • RNN kernel support is limited on ARM devices due to library constraints.
  • INT8 convolution support on ARM Compute Library is noted as forthcoming.
  • Some kernels may require input padding to meet precision or architecture requirements.
Health Check
Last Commit

4 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
13 more.

neon by NervanaSystems

0%
4k
Deep learning framework (discontinued)
Created 11 years ago
Updated 4 years ago
Feedback? Help us improve.