Sophia by Liuhong99

Optimizer for language model pre-training (research paper)

Created 2 years ago

981 stars

Top 37.7% on SourcePulse

View on GitHub

6 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Vincent Weisser

Cofounder of Prime Intellect

Eiso Kant

Cofounder of Poolside AI

Jeremy Howard

Cofounder of fast.ai

and 2 more!

Project Summary

This repository provides the official implementation of SophiaG, a scalable second-order optimizer designed for efficient language model pre-training. It aims to accelerate training by leveraging second-order information (Hessian approximations) while maintaining scalability, offering a faster alternative to first-order optimizers like AdamW and Lion.

How It Works

SophiaG approximates the inverse Hessian using a decaying average of past gradients and their squares, similar to Adam. It then uses this approximation to scale gradient updates, effectively adapting the learning rate per parameter based on curvature information. A key feature is the clipping mechanism on the scaled updates, controlled by a rho hyperparameter, which helps stabilize training and prevent divergence, particularly with large learning rates.

Quick Start & Requirements

Install: pip install sophia-pytorch (or clone and install from source).
Prerequisites: PyTorch 2.1.2, transformers 4.33.0, datasets, tiktoken, wandb.
Usage: Import SophiaG and use it in place of other optimizers, calling optimizer.step(bs=...) and optimizer.update_hessian() periodically.
Docs: https://github.com/Liuhong99/Sophia

Highlighted Details

Claims significantly faster convergence than AdamW and Lion for language model pre-training.
Provides detailed hyperparameter tuning guidance and example configurations for reproducing GPT-2 results.
Supports distributed training via torchrun.
Requires periodic calls to update_hessian() to maintain Hessian approximations.

Maintenance & Community

The project is associated with the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training".
Codebase is based on nanoGPT and levanter.
Encourages community feedback on hyperparameter tuning.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing before commercial use.

Limitations & Caveats

Reproducing results for the 1.5B model requires TPU instances and specific setup via levanter.
Hyperparameter tuning, particularly rho and learning rate, is crucial for optimal performance and stability.
The update_hessian() call needs to be integrated into the training loop, adding a slight complexity compared to first-order optimizers.

Health Check

Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days