qlora by artidoro

Finetuning tool for quantized LLMs

Created 2 years ago

10,804 stars

Top 4.7% on SourcePulse

View on GitHub

24 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Vincent Weisser

Cofounder of Prime Intellect

Jeremy Howard

Cofounder of fast.ai

and 20 more!

Project Summary

QLoRA provides an efficient method for finetuning large language models (LLMs) by reducing memory requirements, enabling training on consumer hardware. It targets researchers and practitioners seeking to adapt LLMs for specific tasks without prohibitive computational costs. The primary benefit is democratizing LLM finetuning, making powerful models accessible for customization.

How It Works

QLoRA backpropagates gradients through a frozen, 4-bit quantized LLM into Low-Rank Adapters (LoRA). This approach uses innovations like 4-bit NormalFloat (NF4) data type, Double Quantization for quantization constants, and Paged Optimizers to manage memory spikes. This significantly reduces memory usage, allowing a 65B parameter model to be finetuned on a single 48GB GPU while maintaining 16-bit finetuning performance.

Quick Start & Requirements

Install via pip install -U -r requirements.txt.
Requires PyTorch, accelerate, transformers (from source), and bitsandbytes.
Recommended for models >13B to adjust learning rate.
Official demo and Colab notebooks are available for inference and finetuning.

Highlighted Details

Achieves 99.3% of ChatGPT performance on the Vicuna benchmark with Guanaco models.
Supports 4-bit quantization with NF4, Double Quantization, and Paged Optimizers.
Enables finetuning of models up to 65B parameters on a single 48GB GPU.
Includes scripts to reproduce Guanaco model training and evaluation using GPT-4.

Maintenance & Community

Developed by the UW NLP group. Integrates with Hugging Face's PEFT and transformers libraries. Code and models are released.

Licensing & Compatibility

MIT license for repository resources. Guanaco models are released under terms aligned with the LLaMA license, requiring access to base LLaMA models.

Limitations & Caveats

4-bit inference is currently slow due to lack of integration with 4-bit matrix multiplication. Resuming LoRA training with Hugging Face's Trainer is not supported. Using bnb_4bit_compute_type='fp16' can lead to instabilities. For 7B LLaMA models, only ~80% of finetuning runs complete without error. Ensure tokenizer.bos_token_id = 1 to avoid generation issues.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days