MARS  by AGI-Arena

Optimization framework for training large models

Created 10 months ago
703 stars

Top 48.5% on SourcePulse

GitHubView on GitHub
Project Summary

MARS is a unified optimization framework designed to accelerate the training of large deep learning models by combining variance reduction techniques with preconditioned gradient methods. It targets researchers and engineers working with large language models and vision models, offering improved convergence and performance over standard optimizers like AdamW.

How It Works

MARS introduces a "scaled stochastic recursive momentum" to reduce gradient variance and a "preconditioned update" to approximate second-order methods. This dual approach aims to achieve better gradient complexity and per-iteration complexity, leading to faster convergence to critical points. It offers three instantiations: MARS-AdamW, MARS-Lion, and MARS-Shampoo, differing in their Hessian matrix approximations.

Quick Start & Requirements

  • Install: pip install torch==2.1.2 transformers==4.33.0 datasets tiktoken numpy==1.26.4 wandb
  • Data Prep: Follow nanoGPT instructions for OpenWebText.
  • Training: torchrun --standalone --nproc_per_node=8 MARS/train_mars.py config/${your_config_file}
  • Prerequisites: PyTorch 2.1.2, Transformers 4.33.0, CUDA-enabled GPU (implied by torchrun and A100 examples).
  • Resources: Training GPT-2 small on A100 requires batch size 15. Reproducing results involves significant compute for large datasets like OpenWebText (50B tokens).
  • Docs: Configuration files, Scripts

Highlighted Details

  • Achieves better test loss and accuracy than AdamW and Muon on CIFAR-10/100 with ResNet-18.
  • GPT-2 XL on FineWeb-Edu reaches 56.52 Hellaswag accuracy with 50B tokens.
  • Outperforms AdamW and Muon on GPT-2 models across various dataset sizes (5B, 20B, 50B tokens).
  • Offers both approximate (MARS-approx) and exact gradient calculations, with the former being faster but slightly less performant.

Maintenance & Community

  • Project is actively updated, with recent additions including vision tasks and reproduction scripts.
  • Built upon nanoGPT, Levanter, and Sophia.
  • Paper available on arXiv: https://arxiv.org/abs/2411.10438

Licensing & Compatibility

  • No explicit license is mentioned in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. Hyperparameters require tuning for MARS-Lion and MARS-Shampoo instantiations. The "exact" MARS variant doubles computational cost.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
4 more.

Sophia by Liuhong99

0.1%
970
Optimizer for language model pre-training (research paper)
Created 2 years ago
Updated 1 year ago
Starred by Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
2 more.

Muon by KellerJordan

1.7%
2k
Optimizer for neural network hidden layers
Created 10 months ago
Updated 2 months ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Casper Hansen Casper Hansen(Author of AutoAWQ), and
1 more.

GPT2 by ConnorJL

0%
1k
GPT2 training implementation, supporting TPUs and GPUs
Created 6 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 16 hours ago
Feedback? Help us improve.