nanoT5 by PiotrNawrot

PyTorch code for T5 pre-training and fine-tuning on a single GPU

Created 2 years ago

1,019 stars

Top 36.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Victor Taelin

Author of Bend, Kind, HVM

Sebastian Raschka

Author of "Build a Large Language Model (From Scratch)"

Jeff Hammerbacher

Cofounder of Cloudera

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Project Summary

nanoT5 provides a PyTorch-based framework for pre-training and fine-tuning T5-style encoder-decoder language models with limited computational resources. It targets researchers and practitioners who need an accessible template for custom T5 model development, aiming to democratize LLM pre-training by demonstrating feasibility on a single A100 GPU within 24 hours.

How It Works

The project optimizes the T5 training pipeline, leveraging HuggingFace Accelerate for distributed training primitives, experiment tracking with Neptune.ai, and hyperparameter management via Hydra. A key innovation is an augmented AdamW optimizer with RMS scaling, which stabilizes training and improves performance compared to the original Adafactor optimizer, especially when paired with a cosine learning rate scheduler. The framework also includes on-the-fly dataset preprocessing for C4 and supports mixed-precision training and PyTorch 2.0 compilation for efficiency.

Quick Start & Requirements

Install: pip install -r requirements.txt after cloning the repository.
Prerequisites: Python 3.8+, PyTorch 2.0 recommended for torch.compile. A single A100 GPU is recommended for achieving the reported <24 hour pre-training times.
Setup: Requires downloading the C4 dataset (handled on-the-fly) and optionally the Super-Natural Instructions dataset for fine-tuning.
Links: GitHub, Paper

Highlighted Details

Achieves 40.7 RougeL on Super-Natural Instructions with 16 hours of pre-training on a single A100, closely matching the original T5-base-v1.1 performance.
Demonstrates that AdamW with RMS scaling and a cosine LR schedule outperforms Adafactor with an inverse-square-root schedule for T5 pre-training.
Offers a simplified T5 model implementation for educational purposes.
Supports efficient training through mixed-precision (TF32, BF16) and torch.compile.

Maintenance & Community

The project is maintained by Piotr Nawrot. Community interaction is encouraged via GitHub Issues.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The reported performance is achieved on an A100 GPU; performance on other hardware may vary. While the project aims for simplicity, advanced parallelism techniques like tensor or pipeline parallelism are not implemented, as they are deemed unnecessary for small-scale training and add significant complexity. FP16 precision experiments diverged, limiting precision options.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days