Stanford-CS336  by YYZhang2025

Building Large Language Models from scratch

Created 10 months ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository offers comprehensive solutions and detailed notes for Stanford's CS336 "LLM from Scratch" course, covering foundational and advanced topics. It's designed for engineers and researchers aiming to build and understand Large Language Models by implementing core components, making it a valuable resource for hands-on learning and experimentation.

How It Works

The project systematically implements key LLM building blocks. It begins with Byte Pair Encoding (BPE) for tokenization and progresses to a configurable Transformer language model featuring RMS Norm and Rotary Positional Embeddings (RoPE). Advanced sections explore Mixture of Experts (MoE) layers for enhanced model capacity, Triton-based Flash Attention for computational efficiency, and data parallelism for distributed training. The final assignments focus on LLM alignment techniques, including Supervised Fine-Tuning (SFT), Expert Iteration (EI), and Group Relative Policy Optimization (GRPO), applied to reasoning tasks.

Quick Start & Requirements

  • Primary install / run command: Utilizes uv for environment management (pip install uv or brew install uv). Code execution via uv run <python_file_path>. Dependencies are managed by uv sync.
  • Non-default prerequisites and dependencies: Python, uv, wget, huggingface_hub. Training performance benchmarks are based on a single NVIDIA H100 GPU. Flash Attention implementation leverages Triton. Alignment tasks use GSM8k and Math-12k datasets.
  • Estimated setup time or resource footprint: Data download and environment setup are straightforward. Training times vary significantly based on hardware (e.g., 34m28s on 1x H100 for a specific model config).
  • Links: No direct external links for quick-start guides or demos are provided.

Highlighted Details

  • BPE tokenizer training for TinyStories completed in ~85 seconds, with pre-tokenization taking ~30 minutes.
  • MoE models, particularly one with a reduced d_ff, outperformed a dense model of similar computational cost on the TinyStories dataset.
  • Flash Attention implementation yielded substantial speedups over standard attention, especially in forward passes, attributed to Triton optimization.
  • Alignment assignments demonstrate SFT, EI, and GRPO on reasoning datasets, evaluating zero-shot performance of Qwen2.5-Math-1.5B.

Maintenance & Community

The provided README does not contain information regarding maintenance status, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The README content does not specify the project's license or any compatibility notes for commercial use.

Limitations & Caveats

  • Data parallelism implementation was not fully validated on the author's hardware, showing slower performance than single-GPU training due to overhead, despite verified correctness.
  • An MoE model with identical d_ff to the dense model showed no significant improvement, possibly due to overfitting on the small TinyStories dataset.
  • Solutions for Assignments 03 (Scaling Laws) and 04 (Data) are marked as unimplemented placeholders.
Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.