Adan  by sail-sg

PyTorch implementation of Adan optimizer for faster deep model training

created 2 years ago
797 stars

Top 45.1% on sourcepulse

GitHubView on GitHub
Project Summary

Adan is an adaptive Nesterov momentum algorithm designed to accelerate deep model optimization. It targets researchers and practitioners in deep learning, offering faster convergence and potentially better performance than standard optimizers like AdamW, particularly for large models and datasets.

How It Works

Adan employs a novel adaptive Nesterov momentum approach by incorporating a third moment estimate, $\beta_3$, into the update rule. This allows for more aggressive learning rate scaling and faster convergence. The algorithm is designed to be robust to hyperparameter choices, especially $\beta_2$, and can utilize significantly higher peak learning rates than Adam or AdamW.

Quick Start & Requirements

  • Install via pip: python3 -m pip install git+https://github.com/sail-sg/Adan.git
  • FusedAdan is installed by default. For original Adan: clone repo, cd Adan, python3 setup.py install --unfused.
  • Requires PyTorch.
  • See detailed instructions for ViTs, ResNets, ConvNeXt, MAE, BERT, Transformer-XL, and GPT2 in the repository.

Highlighted Details

  • Achieves comparable or better results than AdamW on LLMs (MoE, GPT2) and vision tasks (ViT, ResNet, ConvNeXt) with potentially fewer training steps.
  • FusedAdan variant offers reduced memory footprint and faster execution times, especially on larger models.
  • Supports gradient clipping (max_grad_norm) and offers two weight decay implementations (no_prox flag).
  • Integrates with popular frameworks like NVIDIA NeMo, Huggingface Timm, and OpenMMLab MMClassification.

Maintenance & Community

  • Supported by NVIDIA (NeMo), Meta AI (D-Adaptation), and Baidu (Paddle).
  • Integrated into projects like Consistent3D, MDT V2, and DreamFusion.
  • Active development with releases for LLMs and fused implementations.

Licensing & Compatibility

  • The repository does not explicitly state a license. However, the presence of a LICENSE file (not detailed in the README) is typical for open-source projects. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

  • Adan has a slightly higher GPU memory cost than Adam/AdamW on a single node, though this is mitigated by distributed training strategies like ZeroRedundancyOptimizer.
  • The README mentions a "restart strategy" that can further improve performance but is not used in most experiments.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.