Memory-efficient training for large language models via gradient low-rank projection
Top 27.0% on sourcepulse
GaLore offers a memory-efficient approach to training large language models (LLMs) by employing gradient low-rank projection. It enables full-parameter learning with reduced memory footprint compared to methods like LoRA, making it suitable for researchers and practitioners aiming to train larger models on limited hardware.
How It Works
GaLore projects gradients into a low-rank subspace, significantly reducing memory usage during training. This method is optimizer-agnostic and integrates seamlessly into existing optimizers with minimal code changes. It also supports per-layer weight updates via PyTorch's register_post_accumulate_grad_hook
, further optimizing memory for weight gradients.
Quick Start & Requirements
pip install galore-torch
or install from source.pip install -r exp_requirements.txt
.GaLoreAdamW
, GaLoreAdamW8bit
, or GaLoreAdafactor
optimizers. Per-layer updates require registering hooks.galore_adamw8bit_per_layer
.Highlighted Details
GaLoreAdamW8bit
).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Per-layer weight updates are currently limited to single-GPU training without DistributedDataParallel
. Multi-GPU support for this feature is under development.
9 months ago
1 week