Transformer training benchmark for GPUs
Top 40.7% on sourcepulse
This repository benchmarks the real-world TeraFLOPS achieved by training Transformer models across various NVIDIA GPUs, including multi-GPU and multi-node setups. It targets researchers and engineers needing to estimate training times for large-scale models, providing practical performance data and tools for self-benchmarking.
How It Works
The project measures TeraFLOPS by executing micro-benchmarks and full Transformer layer forward/backward passes for models like BERT, GPT-2, and T5. It compares achieved performance against theoretical hardware limits, offering insights into how factors like precision (TF32/FP16), batch size, and specific GPU architectures impact actual throughput.
Quick Start & Requirements
nvcr.io/nvidia/pytorch:22.07-py3
).Highlighted Details
Maintenance & Community
No specific community channels or contributor details are listed in the README.
Licensing & Compatibility
The repository's license is not specified in the README.
Limitations & Caveats
Performance figures are specific to the hardware and configurations tested by the authors and may vary significantly based on user's environment, CUDA version, and specific model implementations.
1 year ago
1 day