Efficiently train foundation models with PyTorch
Top 98.6% on sourcepulse
This repository provides an example for efficiently pre-training foundation models, specifically Llama2, using native PyTorch features like Fully Sharded Data Parallel (FSDP) and the Scalable Data Parallelism API (SDPA) for Flash Attention v2. It targets researchers and engineers aiming to leverage PyTorch's advanced capabilities for large-scale model training, offering performance benchmarks and practical implementation details.
How It Works
The project leverages PyTorch's FSDP for distributed training and integrates an SDPA implementation of Flash Attention v2 for optimized attention computation. This approach aims to maximize hardware utilization and training throughput by combining these native PyTorch features with techniques like torch.compile
, selective activation checkpointing, and overlapping computation with communication. The goal is to showcase efficient training strategies within the PyTorch ecosystem, rather than providing a full end-to-end framework.
Quick Start & Requirements
pip install -r requirements.txt
.ibm-fms
.sbatch ./scripts/train.slurm
), but torchrun
commands are available.Highlighted Details
Maintenance & Community
This repository is a companion to the Foundation Model Stack and represents IBM's work with the PyTorch community. Specific community links or active maintainer information are not detailed in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README text. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.
Limitations & Caveats
The repository focuses on the pre-training phase and does not include data preparation or post-training alignment/tuning. It uses an internally curated dataset, omitting details on sampling ratios. Performance on smaller batch sizes for larger models may show lower hardware utilization (MFU).
1 week ago
Inactive