nanotron by huggingface

Minimalistic library for large language model pretraining

Created 2 years ago

2,411 stars

Top 18.8% on SourcePulse

16 Experts Love This Project

clmnt

Clement Delangue

Cofounder of Hugging Face

shizhediao

Author of LMFlow; Research Scientist at NVIDIA

vincentweisser

Vincent Weisser

Cofounder of Prime Intellect

luiscape

Cofounder of Lightning AI

and 12 more!

Project Summary

Nanotron is a library for pretraining transformer models, offering a simple, performant, and scalable API for custom datasets. It targets researchers and engineers building large language models, enabling efficient training through advanced parallelism techniques.

How It Works

Nanotron implements 3D parallelism (Data Parallelism, Tensor Parallelism, Pipeline Parallelism) to distribute model training across multiple GPUs and nodes. It supports expert parallelism for Mixture-of-Experts (MoE) models and includes optimized scheduling (AFAB, 1F1B) for pipeline parallelism. The library provides explicit APIs for TP and PP, facilitating debugging and customization, along with ZeRO-1 optimizer and FP32 gradient accumulation for memory efficiency.

Quick Start & Requirements

Installation:

uv venv nanotron --python 3.11 && source nanotron/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu124
uv pip install -e .
uv pip install datasets transformers datatrove[io] numba wandb ninja triton "flash-attn>=2.5.0" --no-build-isolation
huggingface-cli login
wandb login
git-lfs --version

Prerequisites: Python 3.11, PyTorch with CUDA 12.4, Git LFS.
Resources: Requires multiple GPUs (e.g., 8 x H100s for the tiny Llama example).
Docs: Ultrascale Playbook, Your First Training

Highlighted Details

Supports 3D parallelism (DP+TP+PP) and expert parallelism for MoEs.
Includes AFAB and 1F1B schedules for pipeline parallelism.
Features ZeRO-1 optimizer, FP32 gradient accumulation, and parameter tying/sharding.
Offers spectral µTransfer parametrization for scaling neural networks.

Maintenance & Community

Actively developed by Hugging Face.
Examples cover custom dataloaders, Mamba, MoE, and µTransfer.
Roadmap includes FP8 training, ZeRO-3, torch.compile, and ring attention.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

FP8 training and ZeRO-3 (FSDP) are on the roadmap, not yet implemented.
torch.compile support is also planned for future releases.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

54 stars in the last 30 days

Explore Similar Projects

MegaDLMs by JinjieNi

Accelerate diffusion language model training at any scale with GPU optimization

Created 2 months ago

Updated 2 months ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 4 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

ArchScale by microsoft

Toolkit for neural architecture research and scaling

Created 6 months ago

Updated 1 month ago

Starred by

Eugene Yan

Eugene Yan(AI Scientist at AWS),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

3 more.

gpt-oss-recipes by huggingface

OpenAI GPT-OSS model optimization and fine-tuning

Created 5 months ago

Updated 4 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

1 more.

libai by Oneflow-Inc

Large-scale distributed parallel training toolbox

Created 4 years ago

Updated 5 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and

1 more.

VeOmni by ByteDance-Seed

Framework for scaling multimodal model training across accelerators

Created 9 months ago

Updated 1 day ago

Starred by

Chuan Li

Chuan Li(Chief Scientific Officer at Lambda).

NeMo-Framework-Launcher by NVIDIA

Cloud-native tool for launching NeMo framework training jobs

Created 3 years ago

Updated 8 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI), and

20 more.

alpa by alpa-projects

Auto-parallelization framework for large-scale neural network training and serving

Created 4 years ago

Updated 2 years ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

15 more.

torchtitan by pytorch

PyTorch platform for generative AI model training research

Created 2 years ago

Updated 1 day ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and

25 more.

gpt-neox by EleutherAI

Framework for training large-scale autoregressive language models

Created 5 years ago

Updated 1 month ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Amit Jain

Amit Jain(Cofounder of Luma AI), and

22 more.

Megatron-LM by NVIDIA

Framework for training transformer models at scale

Created 6 years ago

Updated 14 hours ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

28 more.

ColossalAI by hpcaitech

AI system for large-scale parallel training

Created 4 years ago

Updated 2 weeks ago

Feedback? Help us improve.