fsdp_qlora by AnswerDotAI

Training script for LLMs using QLoRA + FSDP

Created 2 years ago

1,536 stars

Top 26.8% on SourcePulse

View on GitHub

8 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Calvin French-Owen

Cofounder of Segment

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Maxime Labonne

Head of Post-Training at Liquid AI

and 4 more!

Project Summary

This repository provides a script for training large language models (LLMs) using QLoRA (Quantized Low-Rank Adaptation) combined with Fully Sharded Data Parallelism (FSDP). It targets researchers and practitioners looking to fine-tune LLMs efficiently on limited hardware, offering significant memory savings and faster training times compared to full fine-tuning.

How It Works

The core innovation lies in the integration of QLoRA's 4-bit quantization with PyTorch's FSDP for distributed training. This approach quantizes model weights to 4-bit precision, drastically reducing memory requirements. FSDP then shards the quantized model, optimizer states, and gradients across multiple GPUs. Custom low-memory loading code is employed to load and quantize model layers iteratively, avoiding the need to load the entire model into GPU memory at once. The script supports both bitsandbytes and HQQ quantization backends, with options for gradient checkpointing and CPU offloading.

Quick Start & Requirements

Installation: Clone the repository, then run pip install llama-recipes fastcore "transformers!=4.38.*,!=4.39.*" --extra-index-url https://download.pytorch.org/whl/test/cu118 (adjust CUDA version as needed) and pip install bitsandbytes>=0.43.0. Log in to Hugging Face Hub (huggingface-cli login).
Prerequisites: CUDA (tested with 11.7, 11.8, 12.1), PyTorch >= 2.2 recommended for Flash Attention 2. Optional: wandb for logging, HQQ custom kernels (hqq/kernels/setup_cuda.py).
Resources: Example command for Llama-2 70B finetuning requires ~128GB CPU RAM; a swap file is recommended.
Docs: Announcement Blog Post (related to 4-bit LLMs).

Highlighted Details

Supports multiple fine-tuning methods: full parameter, LoRA, QLoRA (bitsandbytes/HQQ), DoRA, and Llama-Pro variants.
Offers flexible mixed-precision training options (fp32, bf16, fp16 autocast) with detailed explanations for each.
Includes custom loading code for quantized models to bypass Hugging Face AutoModel.from_pretrained limitations.
Provides example scripts for training Llama 70B on 4x A100 40GB GPUs using both BnB QLoRA and HQQ QLoRA.

Maintenance & Community

The project is described as an "alpha/preview release," suggesting ongoing development and potential instability.
Integrations are noted with Axolotl (experimental).
No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the dependencies (Hugging Face Transformers, bitsandbytes, PyTorch) generally permit commercial use, but specific restrictions may apply to the base models used (e.g., Llama 2).

Limitations & Caveats

This is an alpha/preview release, requiring users comfortable with testing and debugging. Custom model loading code is necessary due to incompatibilities with Hugging Face Transformers for quantized weights. Careful configuration of FSDP's Mixed Precision is required to avoid corrupting quantized weights.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days