grokfast by ironjr

Research paper for accelerated grokking via gradient amplification

Created 1 year ago

572 stars

Top 56.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Jesse Clark

Cofounder of Marqo

Project Summary

Grokfast accelerates the "grokking" phenomenon in machine learning, where models exhibit delayed generalization after overfitting. This project offers a simple, drop-in solution for practitioners seeking to speed up this process across diverse tasks like image, language, and graph modeling.

How It Works

Grokfast operates by spectrally decomposing parameter gradients into fast and slow-varying components. It then amplifies the slow-varying components, which are hypothesized to drive generalization. This is achieved by integrating custom gradient filtering functions (EMA or MA) directly into the optimization loop, modifying gradients before the optimizer step. This approach aims to hasten the transition from overfitting to generalization without altering the core model architecture or training process.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires PyTorch.
Reproduction of experiments requires additional packages listed in requirements.txt.
Setup for basic usage involves downloading grokfast.py and importing its functions.

Highlighted Details

Achieves over 50x acceleration in reaching generalization milestones in experiments.
Demonstrates effectiveness across Transformer decoders, MLPs, LSTMs, and G-CNNs on various datasets.
Offers two filtering methods: gradfilter_ema (Exponential Moving Average) and gradfilter_ma (Moving Average).
Provides guidance on hyperparameter tuning for cutoff frequencies and weight decay.

Maintenance & Community

Project initiated by researchers from Seoul National University.
Code is based on several prior grokking research projects.
Contact email provided for questions.

Licensing & Compatibility

Licensed under the MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided hyperparameter recommendations are based on experimental experience and may require further tuning for optimal performance on new tasks. The gradfilter_ma function's additional memory requirements increase linearly with window_size.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days