SDK for training LLMs from scratch using PyTorch
Top 72.2% on sourcepulse
This repository provides a straightforward PyTorch implementation for training a Transformer-based Large Language Model (LLM) from scratch. It's designed for researchers and developers who want to understand and experiment with LLM training pipelines, from data preparation to text generation, potentially on consumer-grade GPUs.
How It Works
The project implements a standard Transformer architecture, including multi-head self-attention and feed-forward networks, following the "Attention Is All You Need" paper. It processes data from The Pile dataset, tokenizes it using tiktoken
, and stores it in HDF5 format for efficient loading. Training utilizes batch processing and includes mechanisms for learning rate decay and model evaluation.
Quick Start & Requirements
pip install -r requirements.txt
tiktoken
, h5py
, tqdm
, zstandard
. A GPU is required for training; Colab/Kaggle T4 is sufficient for ~13M parameter models, but larger models require more VRAM.PYTHONPATH
, download data (scripts/data_download.py
), preprocess data (scripts/data_preprocess.py
), and train (scripts/train_transformer.py
).Highlighted Details
tiktoken
for data handling and tokenization.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README indicates that while a ~13M parameter model trains reasonably well on consumer GPUs, training billion-parameter models requires significant hardware resources (e.g., NVIDIA A100 40GB+). The provided configuration for a billion-parameter model uses a context length of 512, which might be demanding. The project is presented as a learning tool, and advanced optimization techniques for very large models might not be fully explored.
2 months ago
Inactive