PyTorch implementation of Adam-mini optimizer from a research paper
Top 70.1% on sourcepulse
Adam-mini is a PyTorch optimizer designed to reduce memory usage by 50% compared to AdamW while maintaining or improving performance. It achieves this by partitioning model parameters into blocks and assigning a single learning rate per block, effectively reducing the number of distinct learning rates managed by the optimizer. This is particularly beneficial for large language models and other deep learning architectures where memory constraints are significant.
How It Works
Adam-mini implements a novel approach to learning rate scheduling by partitioning parameters based on Hessian structure principles. Instead of maintaining individual learning rates for each parameter, it groups parameters into blocks and assigns a single learning rate to each block. This strategy, detailed in Algorithm 1 of the associated paper, significantly reduces the memory footprint associated with storing optimizer states (specifically the $1/\sqrt{v}$ term in Adam). The partitioning is designed to be effective across various model architectures, especially Transformers, by considering dimensions like hidden features and attention heads.
Quick Start & Requirements
pip install adam-mini
git clone https://github.com/zyushun/Adam-mini && cd Adam-mini && pip install -e .
dim
, n_heads
, and n_kv_heads
parameters are recommended.Highlighted Details
optimizer.wv_names = {}
is recommended for faster initial convergence.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
model_sharding
argument is deprecated and will be removed in future versions, with model parallelism assumed to be always used.2 months ago
1 day