how-to-train-your-gpt by raiyanyahya

Build a modern LLM from scratch, line by line

Created 2 months ago

2,289 stars

Top 19.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

Project Summary

Build a modern Large Language Model (LLM) from scratch with this interactive textbook. Aimed at Python developers and students, it provides a deep, line-by-line understanding of Transformer architectures, enabling users to construct and train their own GPT models, moving beyond superficial API usage to grasp core computational principles.

How It Works

This project offers a 12-chapter, ~3,600-line interactive guide where users write every component of a GPT model themselves. It employs a pedagogical approach combining five-year-old analogies, worked numerical examples, and meticulously annotated code. The core implementation focuses on a decoder-only Transformer architecture, integrating state-of-the-art techniques like Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU, AdamW optimizer, Byte Pair Encoding (BPE) tokenization, weight tying, and mixed-precision training. This method ensures a comprehensive grasp of internal mechanics, unlike shallow API-based tutorials or dense academic papers.

Quick Start & Requirements

Clone: git clone https://github.com/raiyanyahya/how-to-train-your-gpt.git
Environment: python -m venv gpt_env && source gpt_env/bin/activate (or Windows equivalent).
Install: pip install torch tiktoken datasets numpy matplotlib
GPU Check (Optional): python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Prerequisites: Basic Python proficiency (variables, functions, classes). No prior ML, calculus, or linear algebra knowledge is assumed; these are taught contextually. GPU acceleration is highly recommended for training (~2 hours on RTX 3090); CPU-only execution is approximately 10-50x slower.

Highlighted Details

Implements the latest publicly-documented decoder-only Transformer architecture, mirroring models like LLaMA 3, Mistral, and Qwen 2.5.
Builds core components including a BPE tokenizer (~60 lines), embeddings (~30 lines), RoPE (~70 lines), Multi-Head Attention (~120 lines), Transformer Block (~50 lines), a 124M parameter GPT model (~200 lines), a custom Training Pipeline (~250 lines), and an Inference Engine (~80 lines).
Explains advanced concepts such as the variance argument behind 1/√d_k, RoPE's relative position encoding via rotation, the stability benefits of pre-norm configurations, and detailed gradient flow during backpropagation.
Provides a runnable main.py script for end-to-end training and inference.

Maintenance & Community

The repository welcomes issues and pull requests. Specific community channels (e.g., Discord, Slack), roadmap details, or notable contributor/sponsorship information are not detailed in the README.

Licensing & Compatibility

The open-source license for this repository is not explicitly stated in the provided README. Potential users should verify licensing terms before integration into commercial or closed-source projects.

Limitations & Caveats

This project serves primarily as an educational tool for understanding LLM internals rather than a production-ready framework. CPU-only training is significantly slower. The implemented architecture is based on publicly disclosed techniques; proprietary aspects of models like GPT-4 and Claude remain undisclosed.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

61 stars in the last 30 days