MARS  by AGI-Arena

Optimization framework for training large models

created 8 months ago
708 stars

Top 49.4% on sourcepulse

GitHubView on GitHub
Project Summary

MARS is a unified optimization framework designed to accelerate the training of large deep learning models by combining variance reduction techniques with preconditioned gradient methods. It targets researchers and engineers working with large language models and vision models, offering improved convergence and performance over standard optimizers like AdamW.

How It Works

MARS introduces a "scaled stochastic recursive momentum" to reduce gradient variance and a "preconditioned update" to approximate second-order methods. This dual approach aims to achieve better gradient complexity and per-iteration complexity, leading to faster convergence to critical points. It offers three instantiations: MARS-AdamW, MARS-Lion, and MARS-Shampoo, differing in their Hessian matrix approximations.

Quick Start & Requirements

  • Install: pip install torch==2.1.2 transformers==4.33.0 datasets tiktoken numpy==1.26.4 wandb
  • Data Prep: Follow nanoGPT instructions for OpenWebText.
  • Training: torchrun --standalone --nproc_per_node=8 MARS/train_mars.py config/${your_config_file}
  • Prerequisites: PyTorch 2.1.2, Transformers 4.33.0, CUDA-enabled GPU (implied by torchrun and A100 examples).
  • Resources: Training GPT-2 small on A100 requires batch size 15. Reproducing results involves significant compute for large datasets like OpenWebText (50B tokens).
  • Docs: Configuration files, Scripts

Highlighted Details

  • Achieves better test loss and accuracy than AdamW and Muon on CIFAR-10/100 with ResNet-18.
  • GPT-2 XL on FineWeb-Edu reaches 56.52 Hellaswag accuracy with 50B tokens.
  • Outperforms AdamW and Muon on GPT-2 models across various dataset sizes (5B, 20B, 50B tokens).
  • Offers both approximate (MARS-approx) and exact gradient calculations, with the former being faster but slightly less performant.

Maintenance & Community

  • Project is actively updated, with recent additions including vision tasks and reproduction scripts.
  • Built upon nanoGPT, Levanter, and Sophia.
  • Paper available on arXiv: https://arxiv.org/abs/2411.10438

Licensing & Compatibility

  • No explicit license is mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. Hyperparameters require tuning for MARS-Lion and MARS-Shampoo instantiations. The "exact" MARS variant doubles computational cost.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
80 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
1 more.

Sophia by Liuhong99

0%
965
Optimizer for language model pre-training (research paper)
created 2 years ago
updated 1 year ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
3 more.

modded-nanogpt by KellerJordan

2.6%
3k
Language model training speedrun on 8x H100 GPUs
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.