DeepBench by baidu-research

Deep learning benchmark for hardware performance on core operations

Created 9 years ago

1,103 stars

Top 34.6% on SourcePulse

2 Experts Love This Project

ajtulloch

Cofounder of Thinking Machines Lab

bryancatanzaro

Bryan Catanzaro

VP Applied Deep Learning Research at NVIDIA

Project Summary

DeepBench is a project for benchmarking fundamental deep learning operations (GEMM, convolutions, recurrent layers, and all-reduce) across various hardware platforms. It targets hardware vendors and researchers seeking to understand performance bottlenecks in deep learning training and inference, providing low-level operation benchmarks rather than full model performance.

How It Works

DeepBench defines specific operation sizes and precision requirements for both training and inference. It utilizes vendor-supplied libraries (e.g., cuDNN, MKL) to ensure results reflect typical user experiences. The benchmark measures execution time and FLOPS for operations like dense matrix multiplies, convolutions (NCHW format), recurrent cells (RNN, LSTM, GRU), and All-Reduce communication patterns.

Quick Start & Requirements

Installation: Clone the repository and use make commands with specified paths for CUDA, cuDNN, MPI, and NCCL.
Prerequisites: CUDA Toolkit (tested with 7.5.18), cuDNN (tested with 5.0), OpenMPI (tested with 1.10.2), NCCL (tested with specific commit). For specific hardware: ROCm, MIOpen, rocBLAS for AMD; ARM Compute Library and Eigen for ARM.
Compilation: Requires specifying paths and potentially ARCH for NVIDIA GPUs (e.g., ARCH=sm_61).
Running: Executables are in the bin/ directory. Usage examples: bin/gemm_bench <inference|train> <int8|float|half>, bin/nccl_single_all_reduce <num_gpus>.
Documentation: Detailed results are in the results/ folder, with library specifics in Excel sheets.

Highlighted Details

Benchmarks training with FP16 inputs/FP32 math and inference with INT8 multiply/FP32 accumulate.
Includes benchmarks for server-grade GPUs (NVIDIA, Intel Xeon Phi) and mobile devices (iPhone, Raspberry Pi).
Tests various All-Reduce implementations (NCCL, OSU, Baidu Allreduce, Intel MLSL) across different network topologies.
Supports sparse matrix operations, aiming to incentivize better performance for 90-95% sparsity.

Maintenance & Community

The project is hosted on GitHub by Baidu Research.
Contributions are welcomed from researchers and hardware vendors for new operations, workloads, and hardware platforms.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing terms.

Limitations & Caveats

Does not measure end-to-end model training or inference latency, focusing only on isolated operations.
RNN kernel support is limited on ARM devices due to library constraints.
INT8 convolution support on ARM Compute Library is noted as forthcoming.
Some kernels may require input padding to meet precision or architecture requirements.

Health Check

Last Commit

4 years ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 4 years ago

Updated 1 year ago

yalm by andrewkchan

LLM inference engine in C++/CUDA for educational performance engineering

Created 1 year ago

Updated 4 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

CUDATutorial by PaddleJitLab

CUDA tutorial for high-performance programming

Created 3 years ago

Updated 1 day ago

Starred by

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT),

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI), and

6 more.

openai-gemm by openai

GEMM kernels for single/half precision

Created 9 years ago

Updated 2 years ago

transformers-benchmarks by mli

Transformer training benchmark for GPUs

Created 3 years ago

Updated 2 years ago

cpufp by pigirons

CPU tool for benchmarking peak floating-point performance

Created 10 years ago

Updated 3 weeks ago

bolt by huawei-noah

Deep learning library for high-performance, heterogeneous deployment

Created 6 years ago

Updated 9 months ago

awesome-emdl by csarron

EMDL resources for efficient on-device deep learning research

Created 8 years ago

Updated 2 years ago

nncase by kendryte

AI compiler stack for AI accelerators

Created 7 years ago

Updated 3 days ago

Starred by

David Cournapeau

David Cournapeau(Author of scikit-learn),

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and

5 more.

lectures by gpu-mode

Lecture series for GPU-accelerated computing

Created 2 years ago

Updated 1 month ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Eric Zhang

Eric Zhang(Founding Engineer at Modal), and

9 more.

DeepGEMM by deepseek-ai

CUDA library for efficient FP8 GEMM kernels with fine-grained scaling

Created 11 months ago

Updated 5 days ago

Starred by

François Chollet

François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

13 more.

neon by NervanaSystems

Deep learning framework (discontinued)

Created 11 years ago

Updated 5 years ago

Feedback? Help us improve.