GaLore by jiaweizzhao

Memory-efficient training for large language models via gradient low-rank projection

Created 1 year ago

1,636 stars

Top 25.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Daniel Han

Cofounder of Unsloth

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

GaLore offers a memory-efficient approach to training large language models (LLMs) by employing gradient low-rank projection. It enables full-parameter learning with reduced memory footprint compared to methods like LoRA, making it suitable for researchers and practitioners aiming to train larger models on limited hardware.

How It Works

GaLore projects gradients into a low-rank subspace, significantly reducing memory usage during training. This method is optimizer-agnostic and integrates seamlessly into existing optimizers with minimal code changes. It also supports per-layer weight updates via PyTorch's register_post_accumulate_grad_hook, further optimizing memory for weight gradients.

Quick Start & Requirements

Installation: pip install galore-torch or install from source.
Dependencies: PyTorch 2.1.0+ for per-layer weight updates. Experiment dependencies require pip install -r exp_requirements.txt.
Usage: Integrate GaLoreAdamW, GaLoreAdamW8bit, or GaLoreAdafactor optimizers. Per-layer updates require registering hooks.
Resources: Benchmarks show training LLaMA-7B on a single RTX 4090 (24GB) with activation checkpointing and galore_adamw8bit_per_layer.
Docs: Official Docs

Highlighted Details

Achieves memory savings comparable to LoRA while allowing full-parameter updates.
Supports 8-bit quantization for optimizer states (GaLoreAdamW8bit).
Demonstrated effectiveness in pre-training LLaMA and fine-tuning RoBERTa.
Q-GaLore (Quantized GaLore with INT4 Projection) is available.

Maintenance & Community

Active development with GaLore 2 in progress.
Community discussion via Slack.
Paper accepted to ICML 2024 (Oral).

Licensing & Compatibility

License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Per-layer weight updates are currently limited to single-GPU training without DistributedDataParallel. Multi-GPU support for this feature is under development.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days