YaFSDP by yandex

Sharded data parallelism framework for transformer-like neural networks

Created 1 year ago

983 stars

Top 37.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Elvis Saravia

Founder of DAIR.AI

Luis Capelo

Cofounder of Lightning AI

Wing Lian

Founder of Axolotl AI

Project Summary

YaFSDP is a Sharded Data Parallelism framework designed for efficient training of transformer-like neural network architectures, particularly Large Language Models (LLMs). It targets researchers and engineers working with large-scale models who need to optimize training speed and memory usage, offering up to 20% faster pre-training and improved performance under high memory pressure compared to PyTorch's FSDP.

How It Works

YaFSDP is built to reduce communication and memory operation overhead. While specific internal mechanisms are not detailed in the README, its performance gains suggest optimizations in parameter sharding, gradient communication, and memory management strategies tailored for transformer architectures. This approach aims to maximize GPU utilization and minimize synchronization bottlenecks during distributed training.

Quick Start & Requirements

Installation: Requires building a Docker image using docker/build.sh.
Prerequisites: NVIDIA PyTorch Docker image, patched 🤗 libraries (provided in patches/ folder).
Resources: Benchmarks were conducted on clusters with A100 80 GB GPUs.
Examples: Training examples for causal pre-training (clm.md) and supervised fine-tuning (sft.md) are available.

Highlighted Details

Up to 20% faster pre-training for LLMs compared to PyTorch FSDP.
Demonstrated performance improvements across models from 7B to 70B parameters and 64 to 256 devices.
Achieves significant speedups (up to 26.60%) on larger models like Llama 3 70B.
Optimized for high memory pressure conditions.

Maintenance & Community

Developed and maintained by Yandex. Users can open GitHub issues for bugs or questions.

Licensing & Compatibility

The README does not explicitly state the license.

Limitations & Caveats

The project requires building a custom Docker image with patched libraries, indicating potential integration complexity and a dependency on specific library versions.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days