train-llm-from-scratch by FareedKhan-dev

SDK for training LLMs from scratch using PyTorch

Created 1 year ago

504 stars

Top 61.8% on SourcePulse

Project Summary

This repository provides a straightforward PyTorch implementation for training a Transformer-based Large Language Model (LLM) from scratch. It's designed for researchers and developers who want to understand and experiment with LLM training pipelines, from data preparation to text generation, potentially on consumer-grade GPUs.

How It Works

The project implements a standard Transformer architecture, including multi-head self-attention and feed-forward networks, following the "Attention Is All You Need" paper. It processes data from The Pile dataset, tokenizes it using tiktoken, and stores it in HDF5 format for efficient loading. Training utilizes batch processing and includes mechanisms for learning rate decay and model evaluation.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.8+, PyTorch, tiktoken, h5py, tqdm, zstandard. A GPU is required for training; Colab/Kaggle T4 is sufficient for ~13M parameter models, but larger models require more VRAM.
Setup: Clone the repository, set PYTHONPATH, download data (scripts/data_download.py), preprocess data (scripts/data_preprocess.py), and train (scripts/train_transformer.py).
Docs: Step-by-step code explanation

Highlighted Details

Implements a full Transformer architecture from scratch in PyTorch.
Supports training models from millions to billions of parameters, with detailed GPU VRAM requirements provided.
Uses The Pile dataset and tiktoken for data handling and tokenization.
Includes scripts for data download, preprocessing, training, and text generation.
Demonstrates training a ~2.1B parameter model.

Maintenance & Community

The project is maintained by FareedKhan-dev. Contributions are welcome.
Links to author's Resume and GitHub.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README indicates that while a ~13M parameter model trains reasonably well on consumer GPUs, training billion-parameter models requires significant hardware resources (e.g., NVIDIA A100 40GB+). The provided configuration for a billion-parameter model uses a context length of 512, which might be demanding. The project is presented as a learning tool, and advanced optimization techniques for very large models might not be fully explored.

train-llm-from-scratch by FareedKhan-dev

Explore Similar Projects

LLaMA-Cult-and-More by shm007g

Kolosal by KolosalAI

MINI_LLM by jiahe7ay

tiny-llm-zh by wdndev

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing by ghimiresunil

Transformer-from-scratch by waylandzhang

bert4torch by Tongjilibo

lingua by facebookresearch

fcc-intro-to-llms by Infatoshi

LLM-workshop-2024 by rasbt

instructlab by instructlab

gpt-llm-trainer by mshumer