llm-from-scratch by angelos-p

Build a GPT language model from scratch

Created 1 month ago

1,826 stars

Top 23.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Emile Vauge

Founder of Traefik

Project Summary

This repository offers a hands-on workshop to train a ~10M parameter GPT model from scratch on a laptop. It targets engineers and researchers seeking a deep, practical understanding of LLM components, demystifying transformer architectures and training pipelines through a simplified, runnable implementation inspired by nanoGPT. The primary benefit is enabling users to build and train a functional LLM within a single workshop session, fostering intuitive knowledge of AI model internals.

How It Works

The project guides users through building a complete GPT training pipeline piece-by-piece. Core components include a character-level tokenizer, the transformer architecture (embeddings, self-attention, feed-forward networks), and a full training loop with optimization and learning rate scheduling. This modular, from-scratch approach, scaled down for accessibility, allows users to grasp the function and necessity of each element. The choice of character-level tokenization is deliberate, proving effective for small datasets like Shakespeare where BPE tokenizers struggle to learn patterns due to low token frequency.

Quick Start & Requirements

Local setup involves installing uv (via curl/powershell) and running uv sync. Google Colab requires !pip install torch numpy tqdm tiktoken and uploading shakespeare.txt. Prerequisites include Python 3.12+ and any laptop (Mac, Linux, Windows). Training leverages Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. The default ~10M parameter model trains in approximately 45 minutes on an M3 Pro.

Highlighted Details

Offers model configurations from ~0.5M (Tiny) to ~10M (Medium) parameters.
Training times range from ~5 minutes (Tiny) to ~45 minutes (Medium) on an M3 Pro.
Employs character-level tokenization (vocab size ~65) for small datasets, contrasting with BPE for larger corpora.
The architecture follows standard GPT blocks: embeddings, self-attention, MLP, residual connections, and LayerNorm.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or roadmaps are provided in the README.

Licensing & Compatibility

The README does not explicitly state the software license, leaving its terms, restrictions, and compatibility for commercial use or closed-source linking unclear.

Limitations & Caveats

The project's focus on character-level tokenization and smaller model sizes makes it unsuitable for training large-scale, general-purpose LLMs. The absence of a specified license is a significant adoption blocker.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,898 stars in the last 30 days