ml-engineering  by stas00

Open book for LLM/VLM training engineers

created 4 years ago
14,595 stars

Top 3.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive, open-source guide to Machine Learning Engineering, specifically focusing on the training, fine-tuning, and inference of large language and multi-modal models. It is targeted at LLM/VLM training engineers and operators seeking practical, step-by-step instructions and methodologies. The benefit is accelerated learning and problem-solving through curated insights and actionable scripts derived from real-world large-scale model training experiences.

How It Works

The book is structured into logical parts covering essential ML engineering domains: Insights, Hardware, Orchestration, Training, Inference, Development, and Miscellaneous Resources. It offers practical advice, comparison tables for hardware and networks, and includes custom tools for benchmarking and debugging. The approach emphasizes "brain dump" style knowledge sharing, providing copy-paste commands and direct solutions to common challenges encountered during large-scale model training.

Quick Start & Requirements

Highlighted Details

  • Detailed comparisons of high-end accelerators (TFLOPS, memory) and network performance (inter- and intra-node speeds).
  • Custom tools provided for benchmarking network throughput (all_reduce_bench.py), testing inter-node connectivity (torch-distributed-gpu-test.py), and measuring actual accelerator TFLOPS (mamf-finder.py).
  • Practical guides and "copy-paste" solutions for debugging PyTorch applications and using SLURM.
  • Chronicles of LLM/VLM training experiences, including BLOOM-176B and IDEFICS-80B.

Maintenance & Community

The project is maintained by Stas Bekman, with contributions welcomed via Issues or Pull Requests. Updates are announced on Twitter. A community discussion forum is available on GitHub.

Licensing & Compatibility

Content is distributed under the Attribution-ShareAlike 4.0 International license. This license allows for commercial use and linking, provided attribution is given and any derivative works are shared under the same license.

Limitations & Caveats

The content is presented as an "ongoing brain dump" and personal notes, implying it may not follow a strictly curated or edited academic structure. While comprehensive, the practical application of some scripts may depend on specific hardware and software environments.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
0
Star History
1,070 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
506
Handbook for large language model training methodologies
created 2 years ago
updated 1 year ago
Feedback? Help us improve.