scaling-book  by jax-ml

LLM scaling guide on TPUs

created 6 months ago
442 stars

Top 68.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository hosts "How To Scale Your Model," a blog-style textbook focused on demystifying the scaling of Large Language Models (LLMs) on Tensor Processing Units (TPUs). It targets researchers and engineers working with LLMs, providing insights into TPU architecture, large-scale LLM execution, and parallelism strategies to mitigate communication bottlenecks during training and inference.

How It Works

The book delves into the intricacies of TPU hardware and its implications for LLM training. It explains how LLMs are practically implemented at scale, covering various parallelism techniques such as data parallelism, model parallelism, and pipeline parallelism. The core advantage lies in its practical guidance on selecting and optimizing these schemes to avoid communication overhead, a critical factor in efficient large-scale model training.

Quick Start & Requirements

To build locally:

git clone https://github.com/jax-ml/scaling-book.git
cd scaling-book
bundle install
bundle exec jekyll serve

Local access will be at localhost:4000/scaling-book. macOS users may require brew install imagemagick, pip install jupyter, brew install ruby, or git config safe.bareRepository all. Deployment to GitHub Pages is available via sh bin/deploy.

Highlighted Details

  • Comprehensive guide to LLM scaling on TPUs.
  • Focus on parallelism strategies to avoid communication bottlenecks.
  • Blog-style textbook format for accessibility.
  • Includes practical insights for training and inference.

Maintenance & Community

The book is authored by a team from Google DeepMind. Contributions are welcomed via PRs, with comments and questions encouraged on the website's Giscus-powered comment system or GitHub discussions. A Google Contributor License Agreement (CLA) is required for contributions.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the project's license, which may impact commercial adoption. There is no mention of specific hardware requirements beyond TPUs for the concepts discussed, nor are there explicit benchmarks provided within the README.

Health Check
Last commit

3 days ago

Responsiveness

1+ week

Pull Requests (30d)
6
Issues (30d)
0
Star History
190 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Nathan Lambert Nathan Lambert(AI Researcher at AI2), and
4 more.

large_language_model_training_playbook by huggingface

0%
478
Tips for training large language models
created 2 years ago
updated 2 years ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
506
Handbook for large language model training methodologies
created 2 years ago
updated 1 year ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

cookbook by EleutherAI

0.1%
809
Deep learning resource for practical model work
created 1 year ago
updated 4 days ago
Feedback? Help us improve.