how-to-train-your-gpt  by raiyanyahya

Build a modern LLM from scratch, line by line

Created 3 weeks ago

New!

1,803 stars

Top 23.4% on SourcePulse

GitHubView on GitHub
Project Summary

Build a modern Large Language Model (LLM) from scratch with this interactive textbook. Aimed at Python developers and students, it provides a deep, line-by-line understanding of Transformer architectures, enabling users to construct and train their own GPT models, moving beyond superficial API usage to grasp core computational principles.

How It Works

This project offers a 12-chapter, ~3,600-line interactive guide where users write every component of a GPT model themselves. It employs a pedagogical approach combining five-year-old analogies, worked numerical examples, and meticulously annotated code. The core implementation focuses on a decoder-only Transformer architecture, integrating state-of-the-art techniques like Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU, AdamW optimizer, Byte Pair Encoding (BPE) tokenization, weight tying, and mixed-precision training. This method ensures a comprehensive grasp of internal mechanics, unlike shallow API-based tutorials or dense academic papers.

Quick Start & Requirements

  1. Clone: git clone https://github.com/raiyanyahya/how-to-train-your-gpt.git
  2. Environment: python -m venv gpt_env && source gpt_env/bin/activate (or Windows equivalent).
  3. Install: pip install torch tiktoken datasets numpy matplotlib
  4. GPU Check (Optional): python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Prerequisites: Basic Python proficiency (variables, functions, classes). No prior ML, calculus, or linear algebra knowledge is assumed; these are taught contextually. GPU acceleration is highly recommended for training (~2 hours on RTX 3090); CPU-only execution is approximately 10-50x slower.

Highlighted Details

  • Implements the latest publicly-documented decoder-only Transformer architecture, mirroring models like LLaMA 3, Mistral, and Qwen 2.5.
  • Builds core components including a BPE tokenizer (~60 lines), embeddings (~30 lines), RoPE (~70 lines), Multi-Head Attention (~120 lines), Transformer Block (~50 lines), a 124M parameter GPT model (~200 lines), a custom Training Pipeline (~250 lines), and an Inference Engine (~80 lines).
  • Explains advanced concepts such as the variance argument behind 1/√d_k, RoPE's relative position encoding via rotation, the stability benefits of pre-norm configurations, and detailed gradient flow during backpropagation.
  • Provides a runnable main.py script for end-to-end training and inference.

Maintenance & Community

The repository welcomes issues and pull requests. Specific community channels (e.g., Discord, Slack), roadmap details, or notable contributor/sponsorship information are not detailed in the README.

Licensing & Compatibility

The open-source license for this repository is not explicitly stated in the provided README. Potential users should verify licensing terms before integration into commercial or closed-source projects.

Limitations & Caveats

This project serves primarily as an educational tool for understanding LLM internals rather than a production-ready framework. CPU-only training is significantly slower. The implemented architecture is based on publicly disclosed techniques; proprietary aspects of models like GPT-4 and Claude remain undisclosed.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
1,815 stars in the last 22 days

Explore Similar Projects

Feedback? Help us improve.