Sophia  by Liuhong99

Optimizer for language model pre-training (research paper)

Created 2 years ago
970 stars

Top 38.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation of SophiaG, a scalable second-order optimizer designed for efficient language model pre-training. It aims to accelerate training by leveraging second-order information (Hessian approximations) while maintaining scalability, offering a faster alternative to first-order optimizers like AdamW and Lion.

How It Works

SophiaG approximates the inverse Hessian using a decaying average of past gradients and their squares, similar to Adam. It then uses this approximation to scale gradient updates, effectively adapting the learning rate per parameter based on curvature information. A key feature is the clipping mechanism on the scaled updates, controlled by a rho hyperparameter, which helps stabilize training and prevent divergence, particularly with large learning rates.

Quick Start & Requirements

  • Install: pip install sophia-pytorch (or clone and install from source).
  • Prerequisites: PyTorch 2.1.2, transformers 4.33.0, datasets, tiktoken, wandb.
  • Usage: Import SophiaG and use it in place of other optimizers, calling optimizer.step(bs=...) and optimizer.update_hessian() periodically.
  • Docs: https://github.com/Liuhong99/Sophia

Highlighted Details

  • Claims significantly faster convergence than AdamW and Lion for language model pre-training.
  • Provides detailed hyperparameter tuning guidance and example configurations for reproducing GPT-2 results.
  • Supports distributed training via torchrun.
  • Requires periodic calls to update_hessian() to maintain Hessian approximations.

Maintenance & Community

  • The project is associated with the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training".
  • Codebase is based on nanoGPT and levanter.
  • Encourages community feedback on hyperparameter tuning.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing before commercial use.

Limitations & Caveats

  • Reproducing results for the 1.5B model requires TPU instances and specific setup via levanter.
  • Hyperparameter tuning, particularly rho and learning rate, is crucial for optimal performance and stability.
  • The update_hessian() call needs to be integrated into the training loop, adding a slight complexity compared to first-order optimizers.
Health Check
Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Victor Taelin Victor Taelin(Author of Bend, Kind, HVM), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
2 more.

nanoT5 by PiotrNawrot

0.2%
1k
PyTorch code for T5 pre-training and fine-tuning on a single GPU
Created 2 years ago
Updated 1 year ago
Starred by Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
2 more.

Muon by KellerJordan

1.7%
2k
Optimizer for neural network hidden layers
Created 10 months ago
Updated 2 months ago
Feedback? Help us improve.