Sophia  by Liuhong99

Optimizer for language model pre-training (research paper)

created 2 years ago
965 stars

Top 39.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation of SophiaG, a scalable second-order optimizer designed for efficient language model pre-training. It aims to accelerate training by leveraging second-order information (Hessian approximations) while maintaining scalability, offering a faster alternative to first-order optimizers like AdamW and Lion.

How It Works

SophiaG approximates the inverse Hessian using a decaying average of past gradients and their squares, similar to Adam. It then uses this approximation to scale gradient updates, effectively adapting the learning rate per parameter based on curvature information. A key feature is the clipping mechanism on the scaled updates, controlled by a rho hyperparameter, which helps stabilize training and prevent divergence, particularly with large learning rates.

Quick Start & Requirements

  • Install: pip install sophia-pytorch (or clone and install from source).
  • Prerequisites: PyTorch 2.1.2, transformers 4.33.0, datasets, tiktoken, wandb.
  • Usage: Import SophiaG and use it in place of other optimizers, calling optimizer.step(bs=...) and optimizer.update_hessian() periodically.
  • Docs: https://github.com/Liuhong99/Sophia

Highlighted Details

  • Claims significantly faster convergence than AdamW and Lion for language model pre-training.
  • Provides detailed hyperparameter tuning guidance and example configurations for reproducing GPT-2 results.
  • Supports distributed training via torchrun.
  • Requires periodic calls to update_hessian() to maintain Hessian approximations.

Maintenance & Community

  • The project is associated with the paper "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training".
  • Codebase is based on nanoGPT and levanter.
  • Encourages community feedback on hyperparameter tuning.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing before commercial use.

Limitations & Caveats

  • Reproducing results for the 1.5B model requires TPU instances and specific setup via levanter.
  • Hyperparameter tuning, particularly rho and learning rate, is crucial for optimal performance and stability.
  • The update_hessian() call needs to be integrated into the training loop, adding a slight complexity compared to first-order optimizers.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

fine-tune-mistral by abacaj

0.3%
716
Fine-tuning script for Mistral-7B
created 1 year ago
updated 1 year ago
Feedback? Help us improve.