Adam-mini  by zyushun

PyTorch implementation of Adam-mini optimizer from a research paper

Created 1 year ago
436 stars

Top 68.4% on SourcePulse

GitHubView on GitHub
Project Summary

Adam-mini is a PyTorch optimizer designed to reduce memory usage by 50% compared to AdamW while maintaining or improving performance. It achieves this by partitioning model parameters into blocks and assigning a single learning rate per block, effectively reducing the number of distinct learning rates managed by the optimizer. This is particularly beneficial for large language models and other deep learning architectures where memory constraints are significant.

How It Works

Adam-mini implements a novel approach to learning rate scheduling by partitioning parameters based on Hessian structure principles. Instead of maintaining individual learning rates for each parameter, it groups parameters into blocks and assigns a single learning rate to each block. This strategy, detailed in Algorithm 1 of the associated paper, significantly reduces the memory footprint associated with storing optimizer states (specifically the $1/\sqrt{v}$ term in Adam). The partitioning is designed to be effective across various model architectures, especially Transformers, by considering dimensions like hidden features and attention heads.

Quick Start & Requirements

  • Install via pip: pip install adam-mini
  • Alternatively, install from source: git clone https://github.com/zyushun/Adam-mini && cd Adam-mini && pip install -e .
  • Requires PyTorch version >= 2.1.0.
  • For Transformer models, dim, n_heads, and n_kv_heads parameters are recommended.
  • Official documentation and examples are available on the GitHub repository.

Highlighted Details

  • Achieves on-par or better performance than AdamW with 50% less memory.
  • Supports popular distributed frameworks: DDP, FSDP, DeepSpeed, Huggingface Trainer, Torchtitan, LLaMA-Factory.
  • Provides example code for pre-training GPT2 and Llama series, and for SFT/RLHF with Llama2-7B.
  • For small total training steps (<10k/20k), optimizer.wv_names = {} is recommended for faster initial convergence.

Maintenance & Community

  • Active development with recent updates including pip installation support, LLaMA-Factory integration, and FSDP CPU-offload.
  • Community support channels are not explicitly mentioned, but contributions are acknowledged.
  • Roadmap details are not provided.

Licensing & Compatibility

  • The repository does not explicitly state a license. Based on common practice for research codebases and the lack of explicit restrictions, it is likely intended for research use. Commercial use should be verified.

Limitations & Caveats

  • There are known issues with checkpoint saving under FSDP, which the developers are actively working on.
  • CPU offload is not supported with DeepSpeed; users must disable it when using DeepSpeed.
  • The model_sharding argument is deprecated and will be removed in future versions, with model parallelism assumed to be always used.
Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
265
Efficiently train foundation models with PyTorch
Created 1 year ago
Updated 1 month ago
Starred by Victor Taelin Victor Taelin(Author of Bend, Kind, HVM), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
2 more.

nanoT5 by PiotrNawrot

0.2%
1k
PyTorch code for T5 pre-training and fine-tuning on a single GPU
Created 2 years ago
Updated 1 year ago
Starred by Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
2 more.

Muon by KellerJordan

1.7%
2k
Optimizer for neural network hidden layers
Created 10 months ago
Updated 2 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
11 more.

Liger-Kernel by linkedin

0.6%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.