BAdam by Ledzy

Memory-efficient optimizer for large language model finetuning

Created 1 year ago

275 stars

Top 94.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

BAdam offers a memory-efficient alternative to full-parameter fine-tuning of large language models by applying Adam's update rule to small parameter blocks sequentially. This approach significantly reduces memory requirements, enabling fine-tuning of models like Llama 3-8B on a single RTX3090, while achieving competitive or superior performance compared to LoRA.

How It Works

BAdam implements block coordinate optimization by iterating through partitions of the model's parameters (e.g., individual transformer layers). For a specified number of updates (switch_block_every), it applies the Adam optimizer only to the current active block, keeping other parameters frozen. This sequential block processing drastically lowers the peak memory usage for optimizer states and gradients. The library provides flexibility in defining these blocks, from entire layers to specific matrix modules, and supports model parallelism via DeepSpeed ZeRO-3 for distributed training.

Quick Start & Requirements

Install via pip: pip install badam
Build from source: git clone https://github.com/Ledzy/BAdam.git && cd BAdam && pip install -e .
For reproducing paper results: conda create -n badam python=3.10 && conda activate badam && pip install -r requirements.txt
Requires PyTorch, mixed-precision training (e.g., bf16=True), and potentially NVIDIA GPUs (tested with RTX3090).

Highlighted Details

Achieves 21.8 GB memory for Llama 2-7B and 23.5 GB for Llama 3-8B, compared to 122.8 GB+ and 144 GB+ for full Adam.
Outperforms LoRA on MT-bench scores (e.g., 6.67 for BAdam vs. 6.41 for LoRA on Llama 3-8B).
Supports custom block partitioning (e.g., by module, by parameter ratio) and DeepSpeed ZeRO-3 for model parallelism.
Adaptive switch_block_every hyperparameter suggestion: min(max(n/(BD), 50), 100).

Maintenance & Community

Accepted to NeurIPS 2024.
Integrated into LLaMA-Factory.
Supports model parallelism with DeepSpeed ZeRO-3.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The BlockOptimizerRatio is under active development and currently only supports Adam updates, with potential overhead from gradient sparsification.
Model parallelism with DeepSpeed ZeRO-3 can introduce significant communication overhead (e.g., ~3x observed for Llama 3-8B on 4 GPUs without NVLink).
Careful configuration of block partitioning is needed for tasks with randomly initialized layers.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days