YaFSDP  by yandex

Sharded data parallelism framework for transformer-like neural networks

created 1 year ago
972 stars

Top 38.7% on sourcepulse

GitHubView on GitHub
Project Summary

YaFSDP is a Sharded Data Parallelism framework designed for efficient training of transformer-like neural network architectures, particularly Large Language Models (LLMs). It targets researchers and engineers working with large-scale models who need to optimize training speed and memory usage, offering up to 20% faster pre-training and improved performance under high memory pressure compared to PyTorch's FSDP.

How It Works

YaFSDP is built to reduce communication and memory operation overhead. While specific internal mechanisms are not detailed in the README, its performance gains suggest optimizations in parameter sharding, gradient communication, and memory management strategies tailored for transformer architectures. This approach aims to maximize GPU utilization and minimize synchronization bottlenecks during distributed training.

Quick Start & Requirements

  • Installation: Requires building a Docker image using docker/build.sh.
  • Prerequisites: NVIDIA PyTorch Docker image, patched 🤗 libraries (provided in patches/ folder).
  • Resources: Benchmarks were conducted on clusters with A100 80 GB GPUs.
  • Examples: Training examples for causal pre-training (clm.md) and supervised fine-tuning (sft.md) are available.

Highlighted Details

  • Up to 20% faster pre-training for LLMs compared to PyTorch FSDP.
  • Demonstrated performance improvements across models from 7B to 70B parameters and 64 to 256 devices.
  • Achieves significant speedups (up to 26.60%) on larger models like Llama 3 70B.
  • Optimized for high memory pressure conditions.

Maintenance & Community

Developed and maintained by Yandex. Users can open GitHub issues for bugs or questions.

Licensing & Compatibility

The README does not explicitly state the license.

Limitations & Caveats

The project requires building a custom Docker image with patched libraries, indicating potential integration complexity and a dependency on specific library versions.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 23 hours ago
Feedback? Help us improve.