Sophia by kyegomez

Optimizer for language model pre-training, claiming 2x speedup over Adam

Created 2 years ago

383 stars

Top 74.7% on SourcePulse

Project Summary

Sophia is a second-order optimizer designed to significantly reduce model training costs and accelerate convergence for large language models. It targets researchers and practitioners aiming to cut computational expenses by offering a faster alternative to Adam, claiming up to 50% reduction in training time and compute.

How It Works

Sophia employs a scalable stochastic second-order optimization approach. It uses an inexpensive stochastic estimate of the Hessian's diagonal as a preconditioner, combined with a clipping mechanism to manage update magnitudes. This method aims to provide superior performance over Adam by achieving similar validation loss with fewer steps, less total compute, and reduced wall-clock time. The optimizer supports both Hutchinson and Gauss-Newton-Bartlett Hessian estimators.

Quick Start & Requirements

Install via pip: pip install Sophia-Optimizer
Requires PyTorch.
Usage example and training scripts are available in the experiments folder after cloning the repository.
Official paper: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Highlighted Details

Claims 50% fewer steps, 50% less total compute, and 50% less wall-clock time compared to Adam for equivalent validation pre-training loss.
"Plug-and-play" integration with existing PyTorch training pipelines.
Supports Hutchinson and Gauss-Newton-Bartlett Hessian estimators.
Hyperparameter tuning guide suggests learning rates around half of AdamW's and a transferable rho value (e.g., 0.03-0.04).

Maintenance & Community

The project is actively developed by kyegomez.
Roadmap includes performance improvements, additional Hessian estimators, hyperparameter tuning guides, integration with Andromeda, variants for specific tasks (CV, NLP, RL), distributed training support, and automatic hyperparameter tuning.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

The README mentions that Sophia does not support sparse gradients.
The Gauss-Newton-Bartlett estimator requires access to all input data for Hessian calculation, which might be memory-intensive.
Explicit license information is missing, posing a potential adoption blocker for commercial applications.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days