BAdam  by Ledzy

Memory-efficient optimizer for large language model finetuning

created 1 year ago
265 stars

Top 96.5% on SourcePulse

GitHubView on GitHub
Project Summary

BAdam offers a memory-efficient alternative to full-parameter fine-tuning of large language models by applying Adam's update rule to small parameter blocks sequentially. This approach significantly reduces memory requirements, enabling fine-tuning of models like Llama 3-8B on a single RTX3090, while achieving competitive or superior performance compared to LoRA.

How It Works

BAdam implements block coordinate optimization by iterating through partitions of the model's parameters (e.g., individual transformer layers). For a specified number of updates (switch_block_every), it applies the Adam optimizer only to the current active block, keeping other parameters frozen. This sequential block processing drastically lowers the peak memory usage for optimizer states and gradients. The library provides flexibility in defining these blocks, from entire layers to specific matrix modules, and supports model parallelism via DeepSpeed ZeRO-3 for distributed training.

Quick Start & Requirements

  • Install via pip: pip install badam
  • Build from source: git clone https://github.com/Ledzy/BAdam.git && cd BAdam && pip install -e .
  • For reproducing paper results: conda create -n badam python=3.10 && conda activate badam && pip install -r requirements.txt
  • Requires PyTorch, mixed-precision training (e.g., bf16=True), and potentially NVIDIA GPUs (tested with RTX3090).

Highlighted Details

  • Achieves 21.8 GB memory for Llama 2-7B and 23.5 GB for Llama 3-8B, compared to 122.8 GB+ and 144 GB+ for full Adam.
  • Outperforms LoRA on MT-bench scores (e.g., 6.67 for BAdam vs. 6.41 for LoRA on Llama 3-8B).
  • Supports custom block partitioning (e.g., by module, by parameter ratio) and DeepSpeed ZeRO-3 for model parallelism.
  • Adaptive switch_block_every hyperparameter suggestion: min(max(n/(BD), 50), 100).

Maintenance & Community

  • Accepted to NeurIPS 2024.
  • Integrated into LLaMA-Factory.
  • Supports model parallelism with DeepSpeed ZeRO-3.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

  • The BlockOptimizerRatio is under active development and currently only supports Adam updates, with potential overhead from gradient sparsification.
  • Model parallelism with DeepSpeed ZeRO-3 can introduce significant communication overhead (e.g., ~3x observed for Llama 3-8B on 4 GPUs without NVLink).
  • Careful configuration of block partitioning is needed for tasks with randomly initialized layers.
Health Check
Last commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
created 1 year ago
updated 1 year ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
created 2 years ago
updated 3 months ago
Feedback? Help us improve.