Sharded data parallelism framework for transformer-like neural networks
Top 38.7% on sourcepulse
YaFSDP is a Sharded Data Parallelism framework designed for efficient training of transformer-like neural network architectures, particularly Large Language Models (LLMs). It targets researchers and engineers working with large-scale models who need to optimize training speed and memory usage, offering up to 20% faster pre-training and improved performance under high memory pressure compared to PyTorch's FSDP.
How It Works
YaFSDP is built to reduce communication and memory operation overhead. While specific internal mechanisms are not detailed in the README, its performance gains suggest optimizations in parameter sharding, gradient communication, and memory management strategies tailored for transformer architectures. This approach aims to maximize GPU utilization and minimize synchronization bottlenecks during distributed training.
Quick Start & Requirements
docker/build.sh
.patches/
folder).clm.md
) and supervised fine-tuning (sft.md
) are available.Highlighted Details
Maintenance & Community
Developed and maintained by Yandex. Users can open GitHub issues for bugs or questions.
Licensing & Compatibility
The README does not explicitly state the license.
Limitations & Caveats
The project requires building a custom Docker image with patched libraries, indicating potential integration complexity and a dependency on specific library versions.
1 month ago
1 day