scaling-book by jax-ml

LLM scaling guide on TPUs

Created 11 months ago

805 stars

Top 43.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Benjamin Bolte

Cofounder of K-Scale Labs

Sasha Rush

Research Scientist at Cursor; Professor at Cornell Tech

Roy Frostig

Coauthor of JAX; Research Scientist at Google DeepMind

Eric Zhang

Founding Engineer at Modal

and 1 more!

Project Summary

This repository hosts "How To Scale Your Model," a blog-style textbook focused on demystifying the scaling of Large Language Models (LLMs) on Tensor Processing Units (TPUs). It targets researchers and engineers working with LLMs, providing insights into TPU architecture, large-scale LLM execution, and parallelism strategies to mitigate communication bottlenecks during training and inference.

How It Works

The book delves into the intricacies of TPU hardware and its implications for LLM training. It explains how LLMs are practically implemented at scale, covering various parallelism techniques such as data parallelism, model parallelism, and pipeline parallelism. The core advantage lies in its practical guidance on selecting and optimizing these schemes to avoid communication overhead, a critical factor in efficient large-scale model training.

Quick Start & Requirements

To build locally:

git clone https://github.com/jax-ml/scaling-book.git
cd scaling-book
bundle install
bundle exec jekyll serve

Local access will be at localhost:4000/scaling-book. macOS users may require brew install imagemagick, pip install jupyter, brew install ruby, or git config safe.bareRepository all. Deployment to GitHub Pages is available via sh bin/deploy.

Highlighted Details

Comprehensive guide to LLM scaling on TPUs.
Focus on parallelism strategies to avoid communication bottlenecks.
Blog-style textbook format for accessibility.
Includes practical insights for training and inference.

Maintenance & Community

The book is authored by a team from Google DeepMind. Contributions are welcomed via PRs, with comments and questions encouraged on the website's Giscus-powered comment system or GitHub discussions. A Google Contributor License Agreement (CLA) is required for contributions.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the project's license, which may impact commercial adoption. There is no mention of specific hardware requirements beyond TPUs for the concepts discussed, nor are there explicit benchmarks provided within the README.

Health Check

Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

74 stars in the last 30 days