Finetuning tool for quantized LLMs
Top 4.8% on sourcepulse
QLoRA provides an efficient method for finetuning large language models (LLMs) by reducing memory requirements, enabling training on consumer hardware. It targets researchers and practitioners seeking to adapt LLMs for specific tasks without prohibitive computational costs. The primary benefit is democratizing LLM finetuning, making powerful models accessible for customization.
How It Works
QLoRA backpropagates gradients through a frozen, 4-bit quantized LLM into Low-Rank Adapters (LoRA). This approach uses innovations like 4-bit NormalFloat (NF4) data type, Double Quantization for quantization constants, and Paged Optimizers to manage memory spikes. This significantly reduces memory usage, allowing a 65B parameter model to be finetuned on a single 48GB GPU while maintaining 16-bit finetuning performance.
Quick Start & Requirements
pip install -U -r requirements.txt
.accelerate
, transformers
(from source), and bitsandbytes
.Highlighted Details
Maintenance & Community
Developed by the UW NLP group. Integrates with Hugging Face's PEFT and transformers libraries. Code and models are released.
Licensing & Compatibility
MIT license for repository resources. Guanaco models are released under terms aligned with the LLaMA license, requiring access to base LLaMA models.
Limitations & Caveats
4-bit inference is currently slow due to lack of integration with 4-bit matrix multiplication. Resuming LoRA training with Hugging Face's Trainer is not supported. Using bnb_4bit_compute_type='fp16'
can lead to instabilities. For 7B LLaMA models, only ~80% of finetuning runs complete without error. Ensure tokenizer.bos_token_id = 1
to avoid generation issues.
1 year ago
1 day