Sophia  by kyegomez

Optimizer for language model pre-training, claiming 2x speedup over Adam

created 2 years ago
379 stars

Top 76.2% on sourcepulse

GitHubView on GitHub
Project Summary

Sophia is a second-order optimizer designed to significantly reduce model training costs and accelerate convergence for large language models. It targets researchers and practitioners aiming to cut computational expenses by offering a faster alternative to Adam, claiming up to 50% reduction in training time and compute.

How It Works

Sophia employs a scalable stochastic second-order optimization approach. It uses an inexpensive stochastic estimate of the Hessian's diagonal as a preconditioner, combined with a clipping mechanism to manage update magnitudes. This method aims to provide superior performance over Adam by achieving similar validation loss with fewer steps, less total compute, and reduced wall-clock time. The optimizer supports both Hutchinson and Gauss-Newton-Bartlett Hessian estimators.

Quick Start & Requirements

Highlighted Details

  • Claims 50% fewer steps, 50% less total compute, and 50% less wall-clock time compared to Adam for equivalent validation pre-training loss.
  • "Plug-and-play" integration with existing PyTorch training pipelines.
  • Supports Hutchinson and Gauss-Newton-Bartlett Hessian estimators.
  • Hyperparameter tuning guide suggests learning rates around half of AdamW's and a transferable rho value (e.g., 0.03-0.04).

Maintenance & Community

  • The project is actively developed by kyegomez.
  • Roadmap includes performance improvements, additional Hessian estimators, hyperparameter tuning guides, integration with Andromeda, variants for specific tasks (CV, NLP, RL), distributed training support, and automatic hyperparameter tuning.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

  • The README mentions that Sophia does not support sparse gradients.
  • The Gauss-Newton-Bartlett estimator requires access to all input data for Hessian calculation, which might be memory-intensive.
  • Explicit license information is missing, posing a potential adoption blocker for commercial applications.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
1 more.

Sophia by Liuhong99

0%
965
Optimizer for language model pre-training (research paper)
created 2 years ago
updated 1 year ago
Feedback? Help us improve.