Adam-mini by zyushun

PyTorch implementation of Adam-mini optimizer from a research paper

Created 1 year ago

449 stars

Top 66.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Jiaming Song

Chief Scientist at Luma AI

Project Summary

Adam-mini is a PyTorch optimizer designed to reduce memory usage by 50% compared to AdamW while maintaining or improving performance. It achieves this by partitioning model parameters into blocks and assigning a single learning rate per block, effectively reducing the number of distinct learning rates managed by the optimizer. This is particularly beneficial for large language models and other deep learning architectures where memory constraints are significant.

How It Works

Adam-mini implements a novel approach to learning rate scheduling by partitioning parameters based on Hessian structure principles. Instead of maintaining individual learning rates for each parameter, it groups parameters into blocks and assigns a single learning rate to each block. This strategy, detailed in Algorithm 1 of the associated paper, significantly reduces the memory footprint associated with storing optimizer states (specifically the $1/\sqrt{v}$ term in Adam). The partitioning is designed to be effective across various model architectures, especially Transformers, by considering dimensions like hidden features and attention heads.

Quick Start & Requirements

Install via pip: pip install adam-mini
Alternatively, install from source: git clone https://github.com/zyushun/Adam-mini && cd Adam-mini && pip install -e .
Requires PyTorch version >= 2.1.0.
For Transformer models, dim, n_heads, and n_kv_heads parameters are recommended.
Official documentation and examples are available on the GitHub repository.

Highlighted Details

Achieves on-par or better performance than AdamW with 50% less memory.
Supports popular distributed frameworks: DDP, FSDP, DeepSpeed, Huggingface Trainer, Torchtitan, LLaMA-Factory.
Provides example code for pre-training GPT2 and Llama series, and for SFT/RLHF with Llama2-7B.
For small total training steps (<10k/20k), optimizer.wv_names = {} is recommended for faster initial convergence.

Maintenance & Community

Active development with recent updates including pip installation support, LLaMA-Factory integration, and FSDP CPU-offload.
Community support channels are not explicitly mentioned, but contributions are acknowledged.
Roadmap details are not provided.

Licensing & Compatibility

The repository does not explicitly state a license. Based on common practice for research codebases and the lack of explicit restrictions, it is likely intended for research use. Commercial use should be verified.

Limitations & Caveats

There are known issues with checkpoint saving under FSDP, which the developers are actively working on.
CPU offload is not supported with DeepSpeed; users must disable it when using DeepSpeed.
The model_sharding argument is deprecated and will be removed in future versions, with model parallelism assumed to be always used.

Health Check

Last Commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days