Sophia  by kyegomez

Optimizer for language model pre-training, claiming 2x speedup over Adam

Created 2 years ago
382 stars

Top 74.7% on SourcePulse

GitHubView on GitHub
Project Summary

Sophia is a second-order optimizer designed to significantly reduce model training costs and accelerate convergence for large language models. It targets researchers and practitioners aiming to cut computational expenses by offering a faster alternative to Adam, claiming up to 50% reduction in training time and compute.

How It Works

Sophia employs a scalable stochastic second-order optimization approach. It uses an inexpensive stochastic estimate of the Hessian's diagonal as a preconditioner, combined with a clipping mechanism to manage update magnitudes. This method aims to provide superior performance over Adam by achieving similar validation loss with fewer steps, less total compute, and reduced wall-clock time. The optimizer supports both Hutchinson and Gauss-Newton-Bartlett Hessian estimators.

Quick Start & Requirements

Highlighted Details

  • Claims 50% fewer steps, 50% less total compute, and 50% less wall-clock time compared to Adam for equivalent validation pre-training loss.
  • "Plug-and-play" integration with existing PyTorch training pipelines.
  • Supports Hutchinson and Gauss-Newton-Bartlett Hessian estimators.
  • Hyperparameter tuning guide suggests learning rates around half of AdamW's and a transferable rho value (e.g., 0.03-0.04).

Maintenance & Community

  • The project is actively developed by kyegomez.
  • Roadmap includes performance improvements, additional Hessian estimators, hyperparameter tuning guides, integration with Andromeda, variants for specific tasks (CV, NLP, RL), distributed training support, and automatic hyperparameter tuning.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. This requires clarification for commercial use or closed-source linking.

Limitations & Caveats

  • The README mentions that Sophia does not support sparse gradients.
  • The Gauss-Newton-Bartlett estimator requires access to all input data for Hessian calculation, which might be memory-intensive.
  • Explicit license information is missing, posing a potential adoption blocker for commercial applications.
Health Check
Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
4 more.

Sophia by Liuhong99

0.1%
970
Optimizer for language model pre-training (research paper)
Created 2 years ago
Updated 1 year ago
Starred by Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
2 more.

Muon by KellerJordan

1.7%
2k
Optimizer for neural network hidden layers
Created 10 months ago
Updated 2 months ago
Feedback? Help us improve.