modded-nanogpt  by KellerJordan

Language model training speedrun on 8x H100 GPUs

created 1 year ago
2,954 stars

Top 16.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository focuses on optimizing the training speed of language models, specifically targeting the NanoGPT architecture. It's designed for researchers and engineers interested in pushing the boundaries of efficient LLM training, offering a competitive benchmark for achieving low validation loss on the FineWeb dataset with minimal computational resources and time.

How It Works

The project implements a suite of advanced techniques to drastically reduce training time. Key innovations include a modernized architecture with rotary embeddings, QK-Norm, and ReLU², the use of the Muon optimizer for improved sample efficiency and lower memory footprint, and architectural modifications like untied heads, FP8 matmul, skip connections, and FlexAttention with sliding window patterns. These elements collectively aim to accelerate convergence and reduce computational overhead.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install --pre torch==2.7.0.dev20250310+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade.
  • Data: Run python data/cached_fineweb10B.py 8 to download the first 800M tokens.
  • Run: Execute ./run.sh.
  • Prerequisites: NVIDIA H100 GPUs (8 recommended for record times), CUDA 12.6, Python 3.x. torch.compile adds ~5 minutes of latency on first run.
  • Docker: A Dockerfile is provided for environment standardization.
  • Docs: World record history and Speedrun track 2: GPT-2 Medium.

Highlighted Details

  • Achieves 3.28 validation loss on FineWeb in 3 minutes using 8x H100 GPUs, a significant improvement over the 45-minute baseline.
  • Utilizes the Muon optimizer, offering lower memory usage and ~1.5x better sample efficiency than Adam.
  • Incorporates FlexAttention with long-short sliding window patterns for efficient handling of longer contexts.
  • Features a competitive "speedrun" format, encouraging rapid iteration and record-breaking in LLM training efficiency.

Maintenance & Community

The project is actively maintained by contributors like @KellerJordan, @bozavlado, and @Grad62304977, with a clear history of record progression.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The primary focus is on extreme optimization for specific hardware (8x H100 GPUs) and a particular dataset (FineWeb). Some techniques, like logit softcapping, may not scale well to much larger models or different architectures. The "speedrun" nature implies a focus on achieving a target metric quickly, potentially at the expense of broader generalization or robustness.

Health Check
Last commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
7
Issues (30d)
3
Star History
467 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.