Optimization framework for training large models
Top 49.4% on sourcepulse
MARS is a unified optimization framework designed to accelerate the training of large deep learning models by combining variance reduction techniques with preconditioned gradient methods. It targets researchers and engineers working with large language models and vision models, offering improved convergence and performance over standard optimizers like AdamW.
How It Works
MARS introduces a "scaled stochastic recursive momentum" to reduce gradient variance and a "preconditioned update" to approximate second-order methods. This dual approach aims to achieve better gradient complexity and per-iteration complexity, leading to faster convergence to critical points. It offers three instantiations: MARS-AdamW, MARS-Lion, and MARS-Shampoo, differing in their Hessian matrix approximations.
Quick Start & Requirements
pip install torch==2.1.2 transformers==4.33.0 datasets tiktoken numpy==1.26.4 wandb
torchrun --standalone --nproc_per_node=8 MARS/train_mars.py config/${your_config_file}
torchrun
and A100 examples).Highlighted Details
MARS-approx
) and exact gradient calculations, with the former being faster but slightly less performant.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify a license, which may impact commercial adoption. Hyperparameters require tuning for MARS-Lion and MARS-Shampoo instantiations. The "exact" MARS variant doubles computational cost.
1 month ago
1 day