distributed-training-guide  by LambdaLabsML

PyTorch guide for distributed training of large language models

Created 1 year ago
478 stars

Top 64.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive guide to distributed PyTorch training, targeting ML engineers and researchers working with large neural networks and clusters. It offers best practices for scaling single-GPU training scripts to multi-GPU and multi-node setups, diagnosing common errors, and optimizing memory usage with techniques like FSDP and Tensor Parallelism.

How It Works

The guide progresses through sequential chapters, each building upon the previous one. It starts with a basic single-GPU causal LLM training script and incrementally introduces distributed training concepts and PyTorch implementations, including Data Parallelism (DDP), Fully Sharded Data Parallelism (FSDP), and Tensor Parallelism (TP). The approach emphasizes using minimal, standard PyTorch for distributed logic, avoiding external libraries for core distributed operations.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a virtual environment, and install dependencies using pip install -r requirements.txt flash-attn --no-build-isolation wandb.
  • Prerequisites: Python 3.x, PyTorch, Transformers, Datasets, flash-attn, wandb for experiment tracking. A wandb login is required.
  • Setup Time: Minimal, primarily dependency installation.
  • Resources: Requires access to multi-GPU/multi-node clusters for full utilization.
  • Docs: Neurips 2024 presentation slides

Highlighted Details

  • Step-by-step progression from single-GPU to complex 2D parallelism (FSDP + TP).
  • Focus on diagnosing common cluster training errors and best practices for logging.
  • Demonstrates training large models like Llama-405b.
  • Covers alternative PyTorch-based distributed frameworks.

Maintenance & Community

The project is from Lambda Labs. Links to other Lambda ML projects are provided: ML Times, Text2Video, GPU Benchmark.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The guide focuses exclusively on PyTorch for distributed training and does not cover other frameworks like TensorFlow or JAX. While it aims for minimal dependencies, flash-attn is a significant external requirement for optimal performance. The guide's primary focus is on causal language models.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Nathan Lambert Nathan Lambert(Research Scientist at AI2), and
4 more.

large_language_model_training_playbook by huggingface

0%
479
Tips for training large language models
Created 2 years ago
Updated 2 years ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
511
Handbook for large language model training methodologies
Created 2 years ago
Updated 1 year ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
13 more.

torchtitan by pytorch

0.7%
4k
PyTorch platform for generative AI model training research
Created 1 year ago
Updated 21 hours ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
25 more.

gpt-neox by EleutherAI

0.2%
7k
Framework for training large-scale autoregressive language models
Created 4 years ago
Updated 2 days ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 15 hours ago
Feedback? Help us improve.