modded-nanogpt by KellerJordan

Language model training speedrun on 8x H100 GPUs

Created 1 year ago

4,119 stars

Top 11.9% on SourcePulse

View on GitHub

12 Experts Love This Project

George Hotz

Author of tinygrad; Founder of the tiny corp, comma.ai

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Will DePue

Coauthor of Sora

Jasper Zhang

Cofounder of Hyperbolic

and 8 more!

Project Summary

This repository focuses on optimizing the training speed of language models, specifically targeting the NanoGPT architecture. It's designed for researchers and engineers interested in pushing the boundaries of efficient LLM training, offering a competitive benchmark for achieving low validation loss on the FineWeb dataset with minimal computational resources and time.

How It Works

The project implements a suite of advanced techniques to drastically reduce training time. Key innovations include a modernized architecture with rotary embeddings, QK-Norm, and ReLU², the use of the Muon optimizer for improved sample efficiency and lower memory footprint, and architectural modifications like untied heads, FP8 matmul, skip connections, and FlexAttention with sliding window patterns. These elements collectively aim to accelerate convergence and reduce computational overhead.

Quick Start & Requirements

Install: pip install -r requirements.txt followed by pip install --pre torch==2.7.0.dev20250310+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade.
Data: Run python data/cached_fineweb10B.py 8 to download the first 800M tokens.
Run: Execute ./run.sh.
Prerequisites: NVIDIA H100 GPUs (8 recommended for record times), CUDA 12.6, Python 3.x. torch.compile adds ~5 minutes of latency on first run.
Docker: A Dockerfile is provided for environment standardization.
Docs: World record history and Speedrun track 2: GPT-2 Medium.

Highlighted Details

Achieves 3.28 validation loss on FineWeb in 3 minutes using 8x H100 GPUs, a significant improvement over the 45-minute baseline.
Utilizes the Muon optimizer, offering lower memory usage and ~1.5x better sample efficiency than Adam.
Incorporates FlexAttention with long-short sliding window patterns for efficient handling of longer contexts.
Features a competitive "speedrun" format, encouraging rapid iteration and record-breaking in LLM training efficiency.

Maintenance & Community

The project is actively maintained by contributors like @KellerJordan, @bozavlado, and @Grad62304977, with a clear history of record progression.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The primary focus is on extreme optimization for specific hardware (8x H100 GPUs) and a particular dataset (FineWeb). Some techniques, like logit softcapping, may not scale well to much larger models or different architectures. The "speedrun" nature implies a focus on achieving a target metric quickly, potentially at the expense of broader generalization or robustness.

Health Check

Last Commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

195 stars in the last 30 days