modded-nanogpt  by KellerJordan

Language model training speedrun on 8x H100 GPUs

Created 1 year ago
3,118 stars

Top 15.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository focuses on optimizing the training speed of language models, specifically targeting the NanoGPT architecture. It's designed for researchers and engineers interested in pushing the boundaries of efficient LLM training, offering a competitive benchmark for achieving low validation loss on the FineWeb dataset with minimal computational resources and time.

How It Works

The project implements a suite of advanced techniques to drastically reduce training time. Key innovations include a modernized architecture with rotary embeddings, QK-Norm, and ReLU², the use of the Muon optimizer for improved sample efficiency and lower memory footprint, and architectural modifications like untied heads, FP8 matmul, skip connections, and FlexAttention with sliding window patterns. These elements collectively aim to accelerate convergence and reduce computational overhead.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install --pre torch==2.7.0.dev20250310+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade.
  • Data: Run python data/cached_fineweb10B.py 8 to download the first 800M tokens.
  • Run: Execute ./run.sh.
  • Prerequisites: NVIDIA H100 GPUs (8 recommended for record times), CUDA 12.6, Python 3.x. torch.compile adds ~5 minutes of latency on first run.
  • Docker: A Dockerfile is provided for environment standardization.
  • Docs: World record history and Speedrun track 2: GPT-2 Medium.

Highlighted Details

  • Achieves 3.28 validation loss on FineWeb in 3 minutes using 8x H100 GPUs, a significant improvement over the 45-minute baseline.
  • Utilizes the Muon optimizer, offering lower memory usage and ~1.5x better sample efficiency than Adam.
  • Incorporates FlexAttention with long-short sliding window patterns for efficient handling of longer contexts.
  • Features a competitive "speedrun" format, encouraging rapid iteration and record-breaking in LLM training efficiency.

Maintenance & Community

The project is actively maintained by contributors like @KellerJordan, @bozavlado, and @Grad62304977, with a clear history of record progression.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The primary focus is on extreme optimization for specific hardware (8x H100 GPUs) and a particular dataset (FineWeb). Some techniques, like logit softcapping, may not scale well to much larger models or different architectures. The "speedrun" nature implies a focus on achieving a target metric quickly, potentially at the expense of broader generalization or robustness.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
12
Issues (30d)
1
Star History
91 stars in the last 30 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
36 more.

unsloth by unslothai

0.6%
46k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 14 hours ago
Feedback? Help us improve.