GaLore  by jiaweizzhao

Memory-efficient training for large language models via gradient low-rank projection

created 1 year ago
1,578 stars

Top 27.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GaLore offers a memory-efficient approach to training large language models (LLMs) by employing gradient low-rank projection. It enables full-parameter learning with reduced memory footprint compared to methods like LoRA, making it suitable for researchers and practitioners aiming to train larger models on limited hardware.

How It Works

GaLore projects gradients into a low-rank subspace, significantly reducing memory usage during training. This method is optimizer-agnostic and integrates seamlessly into existing optimizers with minimal code changes. It also supports per-layer weight updates via PyTorch's register_post_accumulate_grad_hook, further optimizing memory for weight gradients.

Quick Start & Requirements

  • Installation: pip install galore-torch or install from source.
  • Dependencies: PyTorch 2.1.0+ for per-layer weight updates. Experiment dependencies require pip install -r exp_requirements.txt.
  • Usage: Integrate GaLoreAdamW, GaLoreAdamW8bit, or GaLoreAdafactor optimizers. Per-layer updates require registering hooks.
  • Resources: Benchmarks show training LLaMA-7B on a single RTX 4090 (24GB) with activation checkpointing and galore_adamw8bit_per_layer.
  • Docs: Official Docs

Highlighted Details

  • Achieves memory savings comparable to LoRA while allowing full-parameter updates.
  • Supports 8-bit quantization for optimizer states (GaLoreAdamW8bit).
  • Demonstrated effectiveness in pre-training LLaMA and fine-tuning RoBERTa.
  • Q-GaLore (Quantized GaLore with INT4 Projection) is available.

Maintenance & Community

  • Active development with GaLore 2 in progress.
  • Community discussion via Slack.
  • Paper accepted to ICML 2024 (Oral).

Licensing & Compatibility

  • License details are not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Per-layer weight updates are currently limited to single-GPU training without DistributedDataParallel. Multi-GPU support for this feature is under development.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
40 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
5 more.

torchtune by pytorch

0.2%
5k
PyTorch library for LLM post-training and experimentation
created 1 year ago
updated 1 day ago
Feedback? Help us improve.