ml-engineering by stas00

Open book for LLM/VLM training engineers

Created 5 years ago

16,231 stars

Top 3.0% on SourcePulse

View on GitHub

22 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Wu Yuxin

Cofounder of Moonshot AI

Roy Frostig

Coauthor of JAX; Research Scientist at Google DeepMind

Jonathan Ragan-Kelley

Professor at MIT

and 18 more!

Project Summary

This repository provides a comprehensive, open-source guide to Machine Learning Engineering, specifically focusing on the training, fine-tuning, and inference of large language and multi-modal models. It is targeted at LLM/VLM training engineers and operators seeking practical, step-by-step instructions and methodologies. The benefit is accelerated learning and problem-solving through curated insights and actionable scripts derived from real-world large-scale model training experiences.

How It Works

The book is structured into logical parts covering essential ML engineering domains: Insights, Hardware, Orchestration, Training, Inference, Development, and Miscellaneous Resources. It offers practical advice, comparison tables for hardware and networks, and includes custom tools for benchmarking and debugging. The approach emphasizes "brain dump" style knowledge sharing, providing copy-paste commands and direct solutions to common challenges encountered during large-scale model training.

Quick Start & Requirements

PDF Version: Available for download, with instructions for building the latest version provided. https://huggingface.co/stas00/ml-engineering-book
Discussions: Community forum for ML engineering topics. https://github.com/stas00/ml-engineering/discussions
Updates: Announced on Twitter. https://twitter.com/StasBekman

Highlighted Details

Detailed comparisons of high-end accelerators (TFLOPS, memory) and network performance (inter- and intra-node speeds).
Custom tools provided for benchmarking network throughput (all_reduce_bench.py), testing inter-node connectivity (torch-distributed-gpu-test.py), and measuring actual accelerator TFLOPS (mamf-finder.py).
Practical guides and "copy-paste" solutions for debugging PyTorch applications and using SLURM.
Chronicles of LLM/VLM training experiences, including BLOOM-176B and IDEFICS-80B.

Maintenance & Community

The project is maintained by Stas Bekman, with contributions welcomed via Issues or Pull Requests. Updates are announced on Twitter. A community discussion forum is available on GitHub.

Licensing & Compatibility

Content is distributed under the Attribution-ShareAlike 4.0 International license. This license allows for commercial use and linking, provided attribution is given and any derivative works are shared under the same license.

Limitations & Caveats

The content is presented as an "ongoing brain dump" and personal notes, implying it may not follow a strictly curated or edited academic structure. While comprehensive, the practical application of some scripts may depend on specific hardware and software environments.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

262 stars in the last 30 days