transformers-benchmarks by mli

Transformer training benchmark for GPUs

Created 3 years ago

918 stars

Top 39.7% on SourcePulse

Project Summary

This repository benchmarks the real-world TeraFLOPS achieved by training Transformer models across various NVIDIA GPUs, including multi-GPU and multi-node setups. It targets researchers and engineers needing to estimate training times for large-scale models, providing practical performance data and tools for self-benchmarking.

How It Works

The project measures TeraFLOPS by executing micro-benchmarks and full Transformer layer forward/backward passes for models like BERT, GPT-2, and T5. It compares achieved performance against theoretical hardware limits, offering insights into how factors like precision (TF32/FP16), batch size, and specific GPU architectures impact actual throughput.

Quick Start & Requirements

Install/Run: Use the provided NVIDIA PyTorch Docker image (nvcr.io/nvidia/pytorch:22.07-py3).
Prerequisites: CUDA-enabled PyTorch, NVIDIA Docker.
Setup: Launch the Docker container, then run Jupyter Notebook within it.
Links: PyTorch Docker Image

Highlighted Details

Benchmarks real TeraFLOPS for Transformer training on A100, A6000, V100, 3090 Ti, and 4090 GPUs.
Compares theoretical vs. actual performance for matrix multiplication and full Transformer layers.
Includes performance data for both forward and forward+backward passes.
Provides Jupyter notebooks for users to run their own benchmarks.

Maintenance & Community

No specific community channels or contributor details are listed in the README.

Licensing & Compatibility

The repository's license is not specified in the README.

Limitations & Caveats

Performance figures are specific to the hardware and configurations tested by the authors and may vary significantly based on user's environment, CUDA version, and specific model implementations.

transformers-benchmarks by mli

Explore Similar Projects

DLPerf by Oneflow-Inc

varuna by microsoft

torch-profiling-tutorial by Quentin-Anthony

optimum-benchmark by huggingface

examples by graphcore

FlagPerf by flagos-ai

fastertransformer_backend by triton-inference-server

bolt by huawei-noah

DeepBench by baidu-research

fastllm by ztxz16

CTranslate2 by OpenNMT

TransformerEngine by NVIDIA