Adam-mini  by zyushun

PyTorch implementation of Adam-mini optimizer from a research paper

created 1 year ago
430 stars

Top 70.1% on sourcepulse

GitHubView on GitHub
Project Summary

Adam-mini is a PyTorch optimizer designed to reduce memory usage by 50% compared to AdamW while maintaining or improving performance. It achieves this by partitioning model parameters into blocks and assigning a single learning rate per block, effectively reducing the number of distinct learning rates managed by the optimizer. This is particularly beneficial for large language models and other deep learning architectures where memory constraints are significant.

How It Works

Adam-mini implements a novel approach to learning rate scheduling by partitioning parameters based on Hessian structure principles. Instead of maintaining individual learning rates for each parameter, it groups parameters into blocks and assigns a single learning rate to each block. This strategy, detailed in Algorithm 1 of the associated paper, significantly reduces the memory footprint associated with storing optimizer states (specifically the $1/\sqrt{v}$ term in Adam). The partitioning is designed to be effective across various model architectures, especially Transformers, by considering dimensions like hidden features and attention heads.

Quick Start & Requirements

  • Install via pip: pip install adam-mini
  • Alternatively, install from source: git clone https://github.com/zyushun/Adam-mini && cd Adam-mini && pip install -e .
  • Requires PyTorch version >= 2.1.0.
  • For Transformer models, dim, n_heads, and n_kv_heads parameters are recommended.
  • Official documentation and examples are available on the GitHub repository.

Highlighted Details

  • Achieves on-par or better performance than AdamW with 50% less memory.
  • Supports popular distributed frameworks: DDP, FSDP, DeepSpeed, Huggingface Trainer, Torchtitan, LLaMA-Factory.
  • Provides example code for pre-training GPT2 and Llama series, and for SFT/RLHF with Llama2-7B.
  • For small total training steps (<10k/20k), optimizer.wv_names = {} is recommended for faster initial convergence.

Maintenance & Community

  • Active development with recent updates including pip installation support, LLaMA-Factory integration, and FSDP CPU-offload.
  • Community support channels are not explicitly mentioned, but contributions are acknowledged.
  • Roadmap details are not provided.

Licensing & Compatibility

  • The repository does not explicitly state a license. Based on common practice for research codebases and the lack of explicit restrictions, it is likely intended for research use. Commercial use should be verified.

Limitations & Caveats

  • There are known issues with checkpoint saving under FSDP, which the developers are actively working on.
  • CPU offload is not supported with DeepSpeed; users must disable it when using DeepSpeed.
  • The model_sharding argument is deprecated and will be removed in future versions, with model parallelism assumed to be always used.
Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.