large_language_model_training_playbook by huggingface

Tips for training large language models

Created 2 years ago

491 stars

Top 63.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Nathan Lambert

Research Scientist at AI2

Omar Sanseviero

DevRel at Google DeepMind

Lysandre Debut

Chief Open-Source Officer at Hugging Face

and 2 more!

Project Summary

This playbook provides practical implementation tips, tricks, and resources for training large language models (LLMs). It targets engineers and researchers involved in LLM development, offering guidance on architecture, parallelism, scaling, precision, hyperparameter tuning, and stability.

How It Works

The playbook is an open collection of curated advice and resources, complementing a more detailed handbook. It addresses common challenges in LLM training, such as selecting model architectures, parallelism strategies, and tensor precision (FP32, FP16, BF16), alongside hyperparameter tuning, batch size optimization, and stability management.

Quick Start & Requirements

This resource is a collection of information and does not have a direct installation or execution command. It requires a foundational understanding of LLM training concepts.

Highlighted Details

Covers critical decisions like model architecture, parallelism strategy, and model size.
Details tensor precision trade-offs (FP32, FP16, BF16) and mixed-precision techniques.
Provides guidance on hyperparameter selection, learning rate schedules, and batch size.
Offers strategies for maximizing throughput and managing training instabilities.

Maintenance & Community

This is an open collection, with contributions welcomed. Further details on community engagement or specific contributors are not provided in the README.

Licensing & Compatibility

The license is not specified in the provided README.

Limitations & Caveats

The playbook is a companion to a more detailed handbook and may not contain exhaustive implementation scripts or code. Specific technical requirements or compatibility notes are not detailed.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days