awsome-distributed-training  by aws-samples

ML training examples and reference architectures on AWS

Created 1 year ago
343 stars

Top 80.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive collection of reference architectures, best practices, and test cases for distributed machine learning model training on AWS. It targets ML engineers and researchers seeking to optimize large-scale training on services like Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS, offering practical examples and validation tools.

How It Works

The project structures distributed training solutions into distinct categories: reference architectures (CloudFormation templates for S3, VPC, and compute clusters), custom AMIs and containers, and detailed test cases. Test cases are organized by framework (PyTorch, JAX) and parallelization strategy (DDP, FSDP, Megatron-LM, NeMo, Trainium), including scripts, configurations, and validation utilities for performance monitoring and troubleshooting.

Quick Start & Requirements

  • Installation: Primarily involves deploying CloudFormation templates for infrastructure setup.
  • Prerequisites: AWS account, CloudFormation, Docker, Python, and potentially specific ML frameworks and libraries depending on the test case. EFA (Elastic Fabric Adapter) is recommended for high-performance networking.
  • Resources: Requires AWS infrastructure provisioning (VPC, S3, compute instances).
  • Documentation: Workshops are available for SageMaker HyperPod and AWS ParallelCluster.

Highlighted Details

  • Supports multiple AWS compute services: SageMaker HyperPod, ParallelCluster, Batch, EKS.
  • Extensive test cases cover various PyTorch distributed training strategies (FSDP, Megatron-LM, NeMo) and AWS Trainium.
  • Includes validation scripts and tools for performance monitoring (e.g., EFA Prometheus exporter, nsight profiling).
  • CI integration with pytest is provided for automated testing.

Maintenance & Community

The repository is part of AWS Samples, indicating official backing. It acknowledges contributions from various individuals.

Licensing & Compatibility

The repository is licensed under the Apache-2.0 License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The repository is undergoing a major refactoring, with test cases being a particular focus. Users preferring the previous structure should refer to v1.1.0. Architectures are designed to work with specific S3 buckets and VPCs created via provided templates.

Health Check
Last Commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)
34
Issues (30d)
11
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Sebastián Ramírez Sebastián Ramírez(Author of FastAPI, Typer, SQLModel, Asyncer), and
1 more.

training by mlcommons

0.2%
2k
Reference implementations for MLPerf training benchmarks
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.