DeepBench  by baidu-research

Deep learning benchmark for hardware performance on core operations

created 8 years ago
1,094 stars

Top 35.4% on sourcepulse

GitHubView on GitHub
Project Summary

DeepBench is a project for benchmarking fundamental deep learning operations (GEMM, convolutions, recurrent layers, and all-reduce) across various hardware platforms. It targets hardware vendors and researchers seeking to understand performance bottlenecks in deep learning training and inference, providing low-level operation benchmarks rather than full model performance.

How It Works

DeepBench defines specific operation sizes and precision requirements for both training and inference. It utilizes vendor-supplied libraries (e.g., cuDNN, MKL) to ensure results reflect typical user experiences. The benchmark measures execution time and FLOPS for operations like dense matrix multiplies, convolutions (NCHW format), recurrent cells (RNN, LSTM, GRU), and All-Reduce communication patterns.

Quick Start & Requirements

  • Installation: Clone the repository and use make commands with specified paths for CUDA, cuDNN, MPI, and NCCL.
  • Prerequisites: CUDA Toolkit (tested with 7.5.18), cuDNN (tested with 5.0), OpenMPI (tested with 1.10.2), NCCL (tested with specific commit). For specific hardware: ROCm, MIOpen, rocBLAS for AMD; ARM Compute Library and Eigen for ARM.
  • Compilation: Requires specifying paths and potentially ARCH for NVIDIA GPUs (e.g., ARCH=sm_61).
  • Running: Executables are in the bin/ directory. Usage examples: bin/gemm_bench <inference|train> <int8|float|half>, bin/nccl_single_all_reduce <num_gpus>.
  • Documentation: Detailed results are in the results/ folder, with library specifics in Excel sheets.

Highlighted Details

  • Benchmarks training with FP16 inputs/FP32 math and inference with INT8 multiply/FP32 accumulate.
  • Includes benchmarks for server-grade GPUs (NVIDIA, Intel Xeon Phi) and mobile devices (iPhone, Raspberry Pi).
  • Tests various All-Reduce implementations (NCCL, OSU, Baidu Allreduce, Intel MLSL) across different network topologies.
  • Supports sparse matrix operations, aiming to incentivize better performance for 90-95% sparsity.

Maintenance & Community

  • The project is hosted on GitHub by Baidu Research.
  • Contributions are welcomed from researchers and hardware vendors for new operations, workloads, and hardware platforms.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

  • Does not measure end-to-end model training or inference latency, focusing only on isolated operations.
  • RNN kernel support is limited on ARM devices due to library constraints.
  • INT8 convolution support on ARM Compute Library is noted as forthcoming.
  • Some kernels may require input padding to meet precision or architecture requirements.
Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), Mckay Wrigley Mckay Wrigley(Founder of Takeoff AI), and
8 more.

ggml by ggml-org

0.3%
13k
Tensor library for machine learning
created 2 years ago
updated 3 days ago
Feedback? Help us improve.