ML training examples and reference architectures on AWS
Top 83.8% on sourcepulse
This repository provides a comprehensive collection of reference architectures, best practices, and test cases for distributed machine learning model training on AWS. It targets ML engineers and researchers seeking to optimize large-scale training on services like Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS, offering practical examples and validation tools.
How It Works
The project structures distributed training solutions into distinct categories: reference architectures (CloudFormation templates for S3, VPC, and compute clusters), custom AMIs and containers, and detailed test cases. Test cases are organized by framework (PyTorch, JAX) and parallelization strategy (DDP, FSDP, Megatron-LM, NeMo, Trainium), including scripts, configurations, and validation utilities for performance monitoring and troubleshooting.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The repository is part of AWS Samples, indicating official backing. It acknowledges contributions from various individuals.
Licensing & Compatibility
The repository is licensed under the Apache-2.0 License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The repository is undergoing a major refactoring, with test cases being a particular focus. Users preferring the previous structure should refer to v1.1.0. Architectures are designed to work with specific S3 buckets and VPCs created via provided templates.
1 day ago
Inactive