Adan by sail-sg

PyTorch implementation of Adan optimizer for faster deep model training

Created 3 years ago

806 stars

Top 43.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Tristan Hume

MTS at Anthropic

Project Summary

Adan is an adaptive Nesterov momentum algorithm designed to accelerate deep model optimization. It targets researchers and practitioners in deep learning, offering faster convergence and potentially better performance than standard optimizers like AdamW, particularly for large models and datasets.

How It Works

Adan employs a novel adaptive Nesterov momentum approach by incorporating a third moment estimate, $\beta_3$, into the update rule. This allows for more aggressive learning rate scaling and faster convergence. The algorithm is designed to be robust to hyperparameter choices, especially $\beta_2$, and can utilize significantly higher peak learning rates than Adam or AdamW.

Quick Start & Requirements

Install via pip: python3 -m pip install git+https://github.com/sail-sg/Adan.git
FusedAdan is installed by default. For original Adan: clone repo, cd Adan, python3 setup.py install --unfused.
Requires PyTorch.
See detailed instructions for ViTs, ResNets, ConvNeXt, MAE, BERT, Transformer-XL, and GPT2 in the repository.

Highlighted Details

Achieves comparable or better results than AdamW on LLMs (MoE, GPT2) and vision tasks (ViT, ResNet, ConvNeXt) with potentially fewer training steps.
FusedAdan variant offers reduced memory footprint and faster execution times, especially on larger models.
Supports gradient clipping (max_grad_norm) and offers two weight decay implementations (no_prox flag).
Integrates with popular frameworks like NVIDIA NeMo, Huggingface Timm, and OpenMMLab MMClassification.

Maintenance & Community

Supported by NVIDIA (NeMo), Meta AI (D-Adaptation), and Baidu (Paddle).
Integrated into projects like Consistent3D, MDT V2, and DreamFusion.
Active development with releases for LLMs and fused implementations.

Licensing & Compatibility

The repository does not explicitly state a license. However, the presence of a LICENSE file (not detailed in the README) is typical for open-source projects. Further clarification on licensing is recommended for commercial use.

Limitations & Caveats

Adan has a slightly higher GPU memory cost than Adam/AdamW on a single node, though this is mitigated by distributed training strategies like ZeroRedundancyOptimizer.
The README mentions a "restart strategy" that can further improve performance but is not used in most experiments.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days