train-llm-from-scratch  by FareedKhan-dev

SDK for training LLMs from scratch using PyTorch

Created 8 months ago
431 stars

Top 68.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a straightforward PyTorch implementation for training a Transformer-based Large Language Model (LLM) from scratch. It's designed for researchers and developers who want to understand and experiment with LLM training pipelines, from data preparation to text generation, potentially on consumer-grade GPUs.

How It Works

The project implements a standard Transformer architecture, including multi-head self-attention and feed-forward networks, following the "Attention Is All You Need" paper. It processes data from The Pile dataset, tokenizes it using tiktoken, and stores it in HDF5 format for efficient loading. Training utilizes batch processing and includes mechanisms for learning rate decay and model evaluation.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, PyTorch, tiktoken, h5py, tqdm, zstandard. A GPU is required for training; Colab/Kaggle T4 is sufficient for ~13M parameter models, but larger models require more VRAM.
  • Setup: Clone the repository, set PYTHONPATH, download data (scripts/data_download.py), preprocess data (scripts/data_preprocess.py), and train (scripts/train_transformer.py).
  • Docs: Step-by-step code explanation

Highlighted Details

  • Implements a full Transformer architecture from scratch in PyTorch.
  • Supports training models from millions to billions of parameters, with detailed GPU VRAM requirements provided.
  • Uses The Pile dataset and tiktoken for data handling and tokenization.
  • Includes scripts for data download, preprocessing, training, and text generation.
  • Demonstrates training a ~2.1B parameter model.

Maintenance & Community

  • The project is maintained by FareedKhan-dev. Contributions are welcome.
  • Links to author's Resume and GitHub.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README indicates that while a ~13M parameter model trains reasonably well on consumer GPUs, training billion-parameter models requires significant hardware resources (e.g., NVIDIA A100 40GB+). The provided configuration for a billion-parameter model uses a context length of 512, which might be demanding. The project is presented as a learning tool, and advanced optimization techniques for very large models might not be fully explored.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.