training  by mlcommons

Reference implementations for MLPerf training benchmarks

created 7 years ago
1,696 stars

Top 25.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides reference implementations for MLPerf™ training benchmarks, targeting ML engineers and researchers seeking to understand or implement standardized machine learning performance tests. It offers a starting point for benchmark implementations, enabling users to evaluate model training performance across various frameworks and hardware.

How It Works

The project offers code for MLPerf training benchmarks, including model implementations in at least one framework, Dockerfiles for containerized execution, dataset download scripts, and timing scripts. This approach standardizes the benchmarking process, allowing for reproducible performance comparisons across different hardware and software stacks.

Quick Start & Requirements

  • Install/Run: Follow instructions within each benchmark's README. Generally involves setting up Docker and dependencies (e.g., install_cuda_docker.sh), downloading datasets (./download_dataset.sh), and building/running the Docker image.
  • Prerequisites: Docker, CUDA (implied by install_cuda_docker.sh), specific framework dependencies (PyTorch, TensorFlow, NeMo, TorchRec, GLT), and large datasets (e.g., LAION-400M-filtered, C4, OpenImages).
  • Resources: Benchmarks can be slow and require significant time and resources on reference hardware.
  • Docs: MLPerf Training Benchmark paper

Highlighted Details

  • Reference implementations for MLPerf Training v5.0, v4.1, and v4.0 benchmarks.
  • Covers diverse models: RetinaNet, Stable Diffusion, BERT, Llama, DLRM, RGAT, GPT3, 3DUnet.
  • Supports multiple frameworks: PyTorch, TensorFlow, NeMo, TorchRec, GLT, Paxml, Megatron-LM.
  • Includes scripts for dataset download and verification.

Maintenance & Community

  • The project is described as "alpha" or "beta" quality, encouraging community contributions via issues and pull requests.
  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The repository itself is not explicitly licensed in the provided README snippet. However, MLPerf is a consortium, and its benchmarks are generally intended for broad adoption. Specific framework licenses (PyTorch, TensorFlow, etc.) will apply to the reference implementations.

Limitations & Caveats

  • Reference implementations are not fully optimized and not intended for "real" performance measurements of software frameworks or hardware.
  • Benchmarks can be slow and resource-intensive.
  • The project is in an early stage ("alpha" or "beta") and may have quality issues or require significant improvements.
Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
7
Star History
34 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.