BAdam  by Ledzy

Memory-efficient optimizer for large language model finetuning

Created 1 year ago
272 stars

Top 94.8% on SourcePulse

GitHubView on GitHub
Project Summary

BAdam offers a memory-efficient alternative to full-parameter fine-tuning of large language models by applying Adam's update rule to small parameter blocks sequentially. This approach significantly reduces memory requirements, enabling fine-tuning of models like Llama 3-8B on a single RTX3090, while achieving competitive or superior performance compared to LoRA.

How It Works

BAdam implements block coordinate optimization by iterating through partitions of the model's parameters (e.g., individual transformer layers). For a specified number of updates (switch_block_every), it applies the Adam optimizer only to the current active block, keeping other parameters frozen. This sequential block processing drastically lowers the peak memory usage for optimizer states and gradients. The library provides flexibility in defining these blocks, from entire layers to specific matrix modules, and supports model parallelism via DeepSpeed ZeRO-3 for distributed training.

Quick Start & Requirements

  • Install via pip: pip install badam
  • Build from source: git clone https://github.com/Ledzy/BAdam.git && cd BAdam && pip install -e .
  • For reproducing paper results: conda create -n badam python=3.10 && conda activate badam && pip install -r requirements.txt
  • Requires PyTorch, mixed-precision training (e.g., bf16=True), and potentially NVIDIA GPUs (tested with RTX3090).

Highlighted Details

  • Achieves 21.8 GB memory for Llama 2-7B and 23.5 GB for Llama 3-8B, compared to 122.8 GB+ and 144 GB+ for full Adam.
  • Outperforms LoRA on MT-bench scores (e.g., 6.67 for BAdam vs. 6.41 for LoRA on Llama 3-8B).
  • Supports custom block partitioning (e.g., by module, by parameter ratio) and DeepSpeed ZeRO-3 for model parallelism.
  • Adaptive switch_block_every hyperparameter suggestion: min(max(n/(BD), 50), 100).

Maintenance & Community

  • Accepted to NeurIPS 2024.
  • Integrated into LLaMA-Factory.
  • Supports model parallelism with DeepSpeed ZeRO-3.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

  • The BlockOptimizerRatio is under active development and currently only supports Adam updates, with potential overhead from gradient sparsification.
  • Model parallelism with DeepSpeed ZeRO-3 can introduce significant communication overhead (e.g., ~3x observed for Llama 3-8B on 4 GPUs without NVLink).
  • Careful configuration of block partitioning is needed for tasks with randomly initialized layers.
Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

YaFSDP by yandex

0.1%
979
Sharded data parallelism framework for transformer-like neural networks
Created 1 year ago
Updated 3 weeks ago
Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0.2%
458
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 5 months ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.