Training script for LLMs using QLoRA + FSDP
Top 27.7% on sourcepulse
This repository provides a script for training large language models (LLMs) using QLoRA (Quantized Low-Rank Adaptation) combined with Fully Sharded Data Parallelism (FSDP). It targets researchers and practitioners looking to fine-tune LLMs efficiently on limited hardware, offering significant memory savings and faster training times compared to full fine-tuning.
How It Works
The core innovation lies in the integration of QLoRA's 4-bit quantization with PyTorch's FSDP for distributed training. This approach quantizes model weights to 4-bit precision, drastically reducing memory requirements. FSDP then shards the quantized model, optimizer states, and gradients across multiple GPUs. Custom low-memory loading code is employed to load and quantize model layers iteratively, avoiding the need to load the entire model into GPU memory at once. The script supports both bitsandbytes and HQQ quantization backends, with options for gradient checkpointing and CPU offloading.
Quick Start & Requirements
pip install llama-recipes fastcore "transformers!=4.38.*,!=4.39.*" --extra-index-url https://download.pytorch.org/whl/test/cu118
(adjust CUDA version as needed) and pip install bitsandbytes>=0.43.0
. Log in to Hugging Face Hub (huggingface-cli login
).wandb
for logging, HQQ custom kernels (hqq/kernels/setup_cuda.py
).Highlighted Details
AutoModel.from_pretrained
limitations.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This is an alpha/preview release, requiring users comfortable with testing and debugging. Custom model loading code is necessary due to incompatibilities with Hugging Face Transformers for quantized weights. Careful configuration of FSDP's Mixed Precision is required to avoid corrupting quantized weights.
8 months ago
1 week