Stanford-CS336 by YYZhang2025

Building Large Language Models from scratch

Created 10 months ago

251 stars

Top 99.8% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository offers comprehensive solutions and detailed notes for Stanford's CS336 "LLM from Scratch" course, covering foundational and advanced topics. It's designed for engineers and researchers aiming to build and understand Large Language Models by implementing core components, making it a valuable resource for hands-on learning and experimentation.

How It Works

The project systematically implements key LLM building blocks. It begins with Byte Pair Encoding (BPE) for tokenization and progresses to a configurable Transformer language model featuring RMS Norm and Rotary Positional Embeddings (RoPE). Advanced sections explore Mixture of Experts (MoE) layers for enhanced model capacity, Triton-based Flash Attention for computational efficiency, and data parallelism for distributed training. The final assignments focus on LLM alignment techniques, including Supervised Fine-Tuning (SFT), Expert Iteration (EI), and Group Relative Policy Optimization (GRPO), applied to reasoning tasks.

Quick Start & Requirements

Primary install / run command: Utilizes uv for environment management (pip install uv or brew install uv). Code execution via uv run <python_file_path>. Dependencies are managed by uv sync.
Non-default prerequisites and dependencies: Python, uv, wget, huggingface_hub. Training performance benchmarks are based on a single NVIDIA H100 GPU. Flash Attention implementation leverages Triton. Alignment tasks use GSM8k and Math-12k datasets.
Estimated setup time or resource footprint: Data download and environment setup are straightforward. Training times vary significantly based on hardware (e.g., 34m28s on 1x H100 for a specific model config).
Links: No direct external links for quick-start guides or demos are provided.

Highlighted Details

BPE tokenizer training for TinyStories completed in ~85 seconds, with pre-tokenization taking ~30 minutes.
MoE models, particularly one with a reduced d_ff, outperformed a dense model of similar computational cost on the TinyStories dataset.
Flash Attention implementation yielded substantial speedups over standard attention, especially in forward passes, attributed to Triton optimization.
Alignment assignments demonstrate SFT, EI, and GRPO on reasoning datasets, evaluating zero-shot performance of Qwen2.5-Math-1.5B.

Maintenance & Community

The provided README does not contain information regarding maintenance status, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The README content does not specify the project's license or any compatibility notes for commercial use.

Limitations & Caveats

Data parallelism implementation was not fully validated on the author's hardware, showing slower performance than single-GPU training due to overhead, despite verified correctness.
An MoE model with identical d_ff to the dense model showed no significant improvement, possibly due to overfitting on the small TinyStories dataset.
Solutions for Assignments 03 (Scaling Laws) and 04 (Data) are marked as unimplemented placeholders.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

17 stars in the last 30 days

Explore Similar Projects

Starred by

Sebastian Raschka

Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)").

mint by dpressel

Minimal PyTorch library for Transformer tutorials

Created 4 years ago

Updated 3 years ago

Building-a-Small-LLM-from-Scratch by KaihuaTang

Tutorial for building LLMs from scratch using PyTorch

Created 1 year ago

Updated 10 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

ArchScale by microsoft

Toolkit for neural architecture research and scaling

Created 1 year ago

Updated 2 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

6 more.

open_lm by mlfoundations

Language model research repo for medium-sized models (up to 7B params)

Created 2 years ago

Updated 1 year ago

EveryonesLLM by HayatoHongo

Build and train Large Language Models from scratch

Created 11 months ago

Updated 3 days ago

tiny-llm-zh by wdndev

Chinese LLM for learning large language models

Created 2 years ago

Updated 1 year ago

llm_from_scratch by vivekkalyanarangan30

Building Large Language Models from scratch with PyTorch

Created 10 months ago

Updated 8 months ago

Starred by

Emile Vauge

Emile Vauge(Founder of Traefik).

llm-from-scratch by angelos-p

Build a GPT language model from scratch

Created 2 months ago

Updated 1 month ago

zero_nlp by yuanzhoulvpi2017

NLP solution for Chinese language models, data, training, and inference

Created 3 years ago

Updated 10 months ago

Starred by

Boris Cherny

Boris Cherny(Creator of Claude Code; MTS at Anthropic),

Andrey Vasnetsov

Andrey Vasnetsov(Cofounder of Qdrant), and

20 more.

fairseq-lua by facebookresearch

Lua-based toolkit for sequence-to-sequence learning

Created 9 years ago

Updated 4 years ago

one-small-step by karminski

Tech tutorial project explaining AI concepts

Created 1 year ago

Updated 3 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Alex Cheema

Alex Cheema(Cofounder of EXO Labs), and

22 more.

unilm by microsoft

Foundation models for language, vision, speech, and multimodal tasks

Created 7 years ago

Updated 5 months ago

Feedback? Help us improve.