APOLLO  by zhuhanqing

Memory-efficient optimizer for LLM training

Created 10 months ago
257 stars

Top 98.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

APOLLO is a memory-efficient optimizer for large language model (LLM) pre-training and fine-tuning, targeting researchers and practitioners facing memory constraints. It achieves SGD-like memory costs while maintaining AdamW-level performance by approximating gradient scaling factors using low-rank auxiliary spaces and random projections, avoiding costly SVD operations.

How It Works

APOLLO leverages two key ideas: structured learning rate updates and optimizer state redundancy reduction. It identifies that channel-wise or tensor-wise gradient scaling is sufficient for LLMs, exploring redundancy in AdamW's element-wise updates. APOLLO approximates these scaling factors in a low-rank auxiliary space via random projections, offering significant memory savings. APOLLO-Mini further reduces memory by using rank-1 tensor-wise scaling, achieving SGD-level costs with superior performance to Adam(W).

Quick Start & Requirements

  • Install via pip: pip install apollo-torch
  • Install from source: git clone https://github.com/zhuhanqing/APOLLO.git && cd APOLLO && pip install -e .
  • Experiment dependencies: pip install -r exp_requirements.txt
  • Requires PyTorch.
  • Official documentation and Hugging Face Transformers integration are available.

Highlighted Details

  • Achieves up to 3x throughput on A100-80GB GPUs by enabling 4x larger batch sizes.
  • Enables pre-training LLaMA-13B on A100-80G with naive DDP.
  • Allows LLaMA-7B training from scratch in under 12GB memory when combined with quantization.
  • Validated by a third-party Julia implementation and integrated into LLaMA-Factory and Hugging Face Transformers.

Maintenance & Community

  • Active development with recent integrations into major frameworks.
  • Core contributors' contact information provided for inquiries.
  • Paper accepted to MLSys'25 with an outstanding paper honorable mention.

Licensing & Compatibility

  • Majority licensed under CC-BY-NC.
  • GaLore components are under Apache 2.0 license.
  • CC-BY-NC may restrict commercial use or linking with closed-source projects.

Limitations & Caveats

The primary license (CC-BY-NC) may impose restrictions on commercial applications. The project acknowledges ongoing work on its to-do list, including FSDP support.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

YaFSDP by yandex

0.1%
979
Sharded data parallelism framework for transformer-like neural networks
Created 1 year ago
Updated 3 weeks ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
39 more.

unsloth by unslothai

0.5%
47k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 9 hours ago
Feedback? Help us improve.