fms-fsdp by foundation-model-stack

Efficiently train foundation models with PyTorch

Created 1 year ago

278 stars

Top 93.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Project Summary

This repository provides an example for efficiently pre-training foundation models, specifically Llama2, using native PyTorch features like Fully Sharded Data Parallel (FSDP) and the Scalable Data Parallelism API (SDPA) for Flash Attention v2. It targets researchers and engineers aiming to leverage PyTorch's advanced capabilities for large-scale model training, offering performance benchmarks and practical implementation details.

How It Works

The project leverages PyTorch's FSDP for distributed training and integrates an SDPA implementation of Flash Attention v2 for optimized attention computation. This approach aims to maximize hardware utilization and training throughput by combining these native PyTorch features with techniques like torch.compile, selective activation checkpointing, and overlapping computation with communication. The goal is to showcase efficient training strategies within the PyTorch ecosystem, rather than providing a full end-to-end framework.

Quick Start & Requirements

Install dependencies using pip install -r requirements.txt.
Recommended: Latest PyTorch nightlies and ibm-fms.
Training example uses Slurm (sbatch ./scripts/train.slurm), but torchrun commands are available.
Requires pre-tokenized data.

Highlighted Details

Achieves 4550 tokens/sec/GPU on 128 A100s and 9600 tokens/sec/GPU on 96 H100s for a 7B model with FSDP and Flash Attention v2.
Demonstrates MFUs ranging from 0.38 to 0.74 and HFUs from 0.46 to 0.74 across different model sizes and hardware.
Trained a Llama2 7B replica to 2.2T tokens, achieving ~20% faster throughput than published Llama2 times.
Includes a script to convert trained models to Hugging Face format.

Maintenance & Community

This repository is a companion to the Foundation Model Stack and represents IBM's work with the PyTorch community. Specific community links or active maintainer information are not detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README text. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The repository focuses on the pre-training phase and does not include data preparation or post-training alignment/tuning. It uses an internally curated dataset, omitting details on sampling ratios. Performance on smaller batch sizes for larger models may show lower hardware utilization (MFU).

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days