qlora  by artidoro

Finetuning tool for quantized LLMs

created 2 years ago
10,586 stars

Top 4.8% on sourcepulse

GitHubView on GitHub
Project Summary

QLoRA provides an efficient method for finetuning large language models (LLMs) by reducing memory requirements, enabling training on consumer hardware. It targets researchers and practitioners seeking to adapt LLMs for specific tasks without prohibitive computational costs. The primary benefit is democratizing LLM finetuning, making powerful models accessible for customization.

How It Works

QLoRA backpropagates gradients through a frozen, 4-bit quantized LLM into Low-Rank Adapters (LoRA). This approach uses innovations like 4-bit NormalFloat (NF4) data type, Double Quantization for quantization constants, and Paged Optimizers to manage memory spikes. This significantly reduces memory usage, allowing a 65B parameter model to be finetuned on a single 48GB GPU while maintaining 16-bit finetuning performance.

Quick Start & Requirements

  • Install via pip install -U -r requirements.txt.
  • Requires PyTorch, accelerate, transformers (from source), and bitsandbytes.
  • Recommended for models >13B to adjust learning rate.
  • Official demo and Colab notebooks are available for inference and finetuning.

Highlighted Details

  • Achieves 99.3% of ChatGPT performance on the Vicuna benchmark with Guanaco models.
  • Supports 4-bit quantization with NF4, Double Quantization, and Paged Optimizers.
  • Enables finetuning of models up to 65B parameters on a single 48GB GPU.
  • Includes scripts to reproduce Guanaco model training and evaluation using GPT-4.

Maintenance & Community

Developed by the UW NLP group. Integrates with Hugging Face's PEFT and transformers libraries. Code and models are released.

Licensing & Compatibility

MIT license for repository resources. Guanaco models are released under terms aligned with the LLaMA license, requiring access to base LLaMA models.

Limitations & Caveats

4-bit inference is currently slow due to lack of integration with 4-bit matrix multiplication. Resuming LoRA training with Hugging Face's Trainer is not supported. Using bnb_4bit_compute_type='fp16' can lead to instabilities. For 7B LLaMA models, only ~80% of finetuning runs complete without error. Ensure tokenizer.bos_token_id = 1 to avoid generation issues.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
208 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.