Language model training speedrun on 8x H100 GPUs
Top 16.6% on sourcepulse
This repository focuses on optimizing the training speed of language models, specifically targeting the NanoGPT architecture. It's designed for researchers and engineers interested in pushing the boundaries of efficient LLM training, offering a competitive benchmark for achieving low validation loss on the FineWeb dataset with minimal computational resources and time.
How It Works
The project implements a suite of advanced techniques to drastically reduce training time. Key innovations include a modernized architecture with rotary embeddings, QK-Norm, and ReLU², the use of the Muon optimizer for improved sample efficiency and lower memory footprint, and architectural modifications like untied heads, FP8 matmul, skip connections, and FlexAttention with sliding window patterns. These elements collectively aim to accelerate convergence and reduce computational overhead.
Quick Start & Requirements
pip install -r requirements.txt
followed by pip install --pre torch==2.7.0.dev20250310+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade
.python data/cached_fineweb10B.py 8
to download the first 800M tokens../run.sh
.torch.compile
adds ~5 minutes of latency on first run.Highlighted Details
Maintenance & Community
The project is actively maintained by contributors like @KellerJordan, @bozavlado, and @Grad62304977, with a clear history of record progression.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The primary focus is on extreme optimization for specific hardware (8x H100 GPUs) and a particular dataset (FineWeb). Some techniques, like logit softcapping, may not scale well to much larger models or different architectures. The "speedrun" nature implies a focus on achieving a target metric quickly, potentially at the expense of broader generalization or robustness.
2 weeks ago
1 week