Discover and explore top open-source AI tools and projects—updated daily.
angelos-pBuild a GPT language model from scratch
Top 23.2% on SourcePulse
This repository offers a hands-on workshop to train a ~10M parameter GPT model from scratch on a laptop. It targets engineers and researchers seeking a deep, practical understanding of LLM components, demystifying transformer architectures and training pipelines through a simplified, runnable implementation inspired by nanoGPT. The primary benefit is enabling users to build and train a functional LLM within a single workshop session, fostering intuitive knowledge of AI model internals.
How It Works
The project guides users through building a complete GPT training pipeline piece-by-piece. Core components include a character-level tokenizer, the transformer architecture (embeddings, self-attention, feed-forward networks), and a full training loop with optimization and learning rate scheduling. This modular, from-scratch approach, scaled down for accessibility, allows users to grasp the function and necessity of each element. The choice of character-level tokenization is deliberate, proving effective for small datasets like Shakespeare where BPE tokenizers struggle to learn patterns due to low token frequency.
Quick Start & Requirements
Local setup involves installing uv (via curl/powershell) and running uv sync. Google Colab requires !pip install torch numpy tqdm tiktoken and uploading shakespeare.txt. Prerequisites include Python 3.12+ and any laptop (Mac, Linux, Windows). Training leverages Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. The default ~10M parameter model trains in approximately 45 minutes on an M3 Pro.
Highlighted Details
Maintenance & Community
No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or roadmaps are provided in the README.
Licensing & Compatibility
The README does not explicitly state the software license, leaving its terms, restrictions, and compatibility for commercial use or closed-source linking unclear.
Limitations & Caveats
The project's focus on character-level tokenization and smaller model sizes makes it unsuitable for training large-scale, general-purpose LLMs. The absence of a specified license is a significant adoption blocker.
4 days ago
Inactive
XueFuzhao
minimaxir
dbiir