Optimizer for language model pre-training (research paper)
Top 39.0% on sourcepulse
This repository provides the official implementation of SophiaG, a scalable second-order optimizer designed for efficient language model pre-training. It aims to accelerate training by leveraging second-order information (Hessian approximations) while maintaining scalability, offering a faster alternative to first-order optimizers like AdamW and Lion.
How It Works
SophiaG approximates the inverse Hessian using a decaying average of past gradients and their squares, similar to Adam. It then uses this approximation to scale gradient updates, effectively adapting the learning rate per parameter based on curvature information. A key feature is the clipping mechanism on the scaled updates, controlled by a rho
hyperparameter, which helps stabilize training and prevent divergence, particularly with large learning rates.
Quick Start & Requirements
pip install sophia-pytorch
(or clone and install from source).SophiaG
and use it in place of other optimizers, calling optimizer.step(bs=...)
and optimizer.update_hessian()
periodically.Highlighted Details
torchrun
.update_hessian()
to maintain Hessian approximations.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
levanter
.rho
and learning rate, is crucial for optimal performance and stability.update_hessian()
call needs to be integrated into the training loop, adding a slight complexity compared to first-order optimizers.1 year ago
Inactive