GaLore  by jiaweizzhao

Memory-efficient training for large language models via gradient low-rank projection

Created 1 year ago
1,606 stars

Top 26.1% on SourcePulse

GitHubView on GitHub
Project Summary

GaLore offers a memory-efficient approach to training large language models (LLMs) by employing gradient low-rank projection. It enables full-parameter learning with reduced memory footprint compared to methods like LoRA, making it suitable for researchers and practitioners aiming to train larger models on limited hardware.

How It Works

GaLore projects gradients into a low-rank subspace, significantly reducing memory usage during training. This method is optimizer-agnostic and integrates seamlessly into existing optimizers with minimal code changes. It also supports per-layer weight updates via PyTorch's register_post_accumulate_grad_hook, further optimizing memory for weight gradients.

Quick Start & Requirements

  • Installation: pip install galore-torch or install from source.
  • Dependencies: PyTorch 2.1.0+ for per-layer weight updates. Experiment dependencies require pip install -r exp_requirements.txt.
  • Usage: Integrate GaLoreAdamW, GaLoreAdamW8bit, or GaLoreAdafactor optimizers. Per-layer updates require registering hooks.
  • Resources: Benchmarks show training LLaMA-7B on a single RTX 4090 (24GB) with activation checkpointing and galore_adamw8bit_per_layer.
  • Docs: Official Docs

Highlighted Details

  • Achieves memory savings comparable to LoRA while allowing full-parameter updates.
  • Supports 8-bit quantization for optimizer states (GaLoreAdamW8bit).
  • Demonstrated effectiveness in pre-training LLaMA and fine-tuning RoBERTa.
  • Q-GaLore (Quantized GaLore with INT4 Projection) is available.

Maintenance & Community

  • Active development with GaLore 2 in progress.
  • Community discussion via Slack.
  • Paper accepted to ICML 2024 (Oral).

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Per-layer weight updates are currently limited to single-GPU training without DistributedDataParallel. Multi-GPU support for this feature is under development.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.