train-llm-from-scratch  by FareedKhan-dev

SDK for training LLMs from scratch using PyTorch

created 6 months ago
411 stars

Top 72.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a straightforward PyTorch implementation for training a Transformer-based Large Language Model (LLM) from scratch. It's designed for researchers and developers who want to understand and experiment with LLM training pipelines, from data preparation to text generation, potentially on consumer-grade GPUs.

How It Works

The project implements a standard Transformer architecture, including multi-head self-attention and feed-forward networks, following the "Attention Is All You Need" paper. It processes data from The Pile dataset, tokenizes it using tiktoken, and stores it in HDF5 format for efficient loading. Training utilizes batch processing and includes mechanisms for learning rate decay and model evaluation.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, PyTorch, tiktoken, h5py, tqdm, zstandard. A GPU is required for training; Colab/Kaggle T4 is sufficient for ~13M parameter models, but larger models require more VRAM.
  • Setup: Clone the repository, set PYTHONPATH, download data (scripts/data_download.py), preprocess data (scripts/data_preprocess.py), and train (scripts/train_transformer.py).
  • Docs: Step-by-step code explanation

Highlighted Details

  • Implements a full Transformer architecture from scratch in PyTorch.
  • Supports training models from millions to billions of parameters, with detailed GPU VRAM requirements provided.
  • Uses The Pile dataset and tiktoken for data handling and tokenization.
  • Includes scripts for data download, preprocessing, training, and text generation.
  • Demonstrates training a ~2.1B parameter model.

Maintenance & Community

  • The project is maintained by FareedKhan-dev. Contributions are welcome.
  • Links to author's Resume and GitHub.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README indicates that while a ~13M parameter model trains reasonably well on consumer GPUs, training billion-parameter models requires significant hardware resources (e.g., NVIDIA A100 40GB+). The provided configuration for a billion-parameter model uses a context length of 512, which might be demanding. The project is presented as a learning tool, and advanced optimization techniques for very large models might not be fully explored.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
109 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

SwissArmyTransformer by THUDM

0.3%
1k
Transformer library for flexible model development
created 3 years ago
updated 7 months ago
Feedback? Help us improve.