training by mlcommons

Reference implementations for MLPerf training benchmarks

Created 7 years ago

1,736 stars

Top 24.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Sebastián Ramírez

Author of FastAPI, Typer, SQLModel, Asyncer

Soumith Chintala

Coauthor of PyTorch

Project Summary

This repository provides reference implementations for MLPerf™ training benchmarks, targeting ML engineers and researchers seeking to understand or implement standardized machine learning performance tests. It offers a starting point for benchmark implementations, enabling users to evaluate model training performance across various frameworks and hardware.

How It Works

The project offers code for MLPerf training benchmarks, including model implementations in at least one framework, Dockerfiles for containerized execution, dataset download scripts, and timing scripts. This approach standardizes the benchmarking process, allowing for reproducible performance comparisons across different hardware and software stacks.

Quick Start & Requirements

Install/Run: Follow instructions within each benchmark's README. Generally involves setting up Docker and dependencies (e.g., install_cuda_docker.sh), downloading datasets (./download_dataset.sh), and building/running the Docker image.
Prerequisites: Docker, CUDA (implied by install_cuda_docker.sh), specific framework dependencies (PyTorch, TensorFlow, NeMo, TorchRec, GLT), and large datasets (e.g., LAION-400M-filtered, C4, OpenImages).
Resources: Benchmarks can be slow and require significant time and resources on reference hardware.
Docs: MLPerf Training Benchmark paper

Highlighted Details

Reference implementations for MLPerf Training v5.0, v4.1, and v4.0 benchmarks.
Covers diverse models: RetinaNet, Stable Diffusion, BERT, Llama, DLRM, RGAT, GPT3, 3DUnet.
Supports multiple frameworks: PyTorch, TensorFlow, NeMo, TorchRec, GLT, Paxml, Megatron-LM.
Includes scripts for dataset download and verification.

Maintenance & Community

The project is described as "alpha" or "beta" quality, encouraging community contributions via issues and pull requests.
No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository itself is not explicitly licensed in the provided README snippet. However, MLPerf is a consortium, and its benchmarks are generally intended for broad adoption. Specific framework licenses (PyTorch, TensorFlow, etc.) will apply to the reference implementations.

Limitations & Caveats

Reference implementations are not fully optimized and not intended for "real" performance measurements of software frameworks or hardware.
Benchmarks can be slow and resource-intensive.
The project is in an early stage ("alpha" or "beta") and may have quality issues or require significant improvements.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days