scaling-book  by jax-ml

LLM scaling guide on TPUs

Created 7 months ago
605 stars

Top 54.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository hosts "How To Scale Your Model," a blog-style textbook focused on demystifying the scaling of Large Language Models (LLMs) on Tensor Processing Units (TPUs). It targets researchers and engineers working with LLMs, providing insights into TPU architecture, large-scale LLM execution, and parallelism strategies to mitigate communication bottlenecks during training and inference.

How It Works

The book delves into the intricacies of TPU hardware and its implications for LLM training. It explains how LLMs are practically implemented at scale, covering various parallelism techniques such as data parallelism, model parallelism, and pipeline parallelism. The core advantage lies in its practical guidance on selecting and optimizing these schemes to avoid communication overhead, a critical factor in efficient large-scale model training.

Quick Start & Requirements

To build locally:

git clone https://github.com/jax-ml/scaling-book.git
cd scaling-book
bundle install
bundle exec jekyll serve

Local access will be at localhost:4000/scaling-book. macOS users may require brew install imagemagick, pip install jupyter, brew install ruby, or git config safe.bareRepository all. Deployment to GitHub Pages is available via sh bin/deploy.

Highlighted Details

  • Comprehensive guide to LLM scaling on TPUs.
  • Focus on parallelism strategies to avoid communication bottlenecks.
  • Blog-style textbook format for accessibility.
  • Includes practical insights for training and inference.

Maintenance & Community

The book is authored by a team from Google DeepMind. Contributions are welcomed via PRs, with comments and questions encouraged on the website's Giscus-powered comment system or GitHub discussions. A Google Contributor License Agreement (CLA) is required for contributions.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the project's license, which may impact commercial adoption. There is no mention of specific hardware requirements beyond TPUs for the concepts discussed, nor are there explicit benchmarks provided within the README.

Health Check
Last Commit

4 days ago

Responsiveness

1+ week

Pull Requests (30d)
8
Issues (30d)
2
Star History
110 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
1 more.

jaxformer by salesforce

0.7%
301
JAX library for LLM training on TPUs
Created 3 years ago
Updated 1 year ago
Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

YaFSDP by yandex

0.1%
975
Sharded data parallelism framework for transformer-like neural networks
Created 1 year ago
Updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

MobiLlama by mbzuai-oryx

0%
660
Small language model for edge devices
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.