fsdp_qlora  by AnswerDotAI

Training script for LLMs using QLoRA + FSDP

created 1 year ago
1,525 stars

Top 27.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a script for training large language models (LLMs) using QLoRA (Quantized Low-Rank Adaptation) combined with Fully Sharded Data Parallelism (FSDP). It targets researchers and practitioners looking to fine-tune LLMs efficiently on limited hardware, offering significant memory savings and faster training times compared to full fine-tuning.

How It Works

The core innovation lies in the integration of QLoRA's 4-bit quantization with PyTorch's FSDP for distributed training. This approach quantizes model weights to 4-bit precision, drastically reducing memory requirements. FSDP then shards the quantized model, optimizer states, and gradients across multiple GPUs. Custom low-memory loading code is employed to load and quantize model layers iteratively, avoiding the need to load the entire model into GPU memory at once. The script supports both bitsandbytes and HQQ quantization backends, with options for gradient checkpointing and CPU offloading.

Quick Start & Requirements

  • Installation: Clone the repository, then run pip install llama-recipes fastcore "transformers!=4.38.*,!=4.39.*" --extra-index-url https://download.pytorch.org/whl/test/cu118 (adjust CUDA version as needed) and pip install bitsandbytes>=0.43.0. Log in to Hugging Face Hub (huggingface-cli login).
  • Prerequisites: CUDA (tested with 11.7, 11.8, 12.1), PyTorch >= 2.2 recommended for Flash Attention 2. Optional: wandb for logging, HQQ custom kernels (hqq/kernels/setup_cuda.py).
  • Resources: Example command for Llama-2 70B finetuning requires ~128GB CPU RAM; a swap file is recommended.
  • Docs: Announcement Blog Post (related to 4-bit LLMs).

Highlighted Details

  • Supports multiple fine-tuning methods: full parameter, LoRA, QLoRA (bitsandbytes/HQQ), DoRA, and Llama-Pro variants.
  • Offers flexible mixed-precision training options (fp32, bf16, fp16 autocast) with detailed explanations for each.
  • Includes custom loading code for quantized models to bypass Hugging Face AutoModel.from_pretrained limitations.
  • Provides example scripts for training Llama 70B on 4x A100 40GB GPUs using both BnB QLoRA and HQQ QLoRA.

Maintenance & Community

  • The project is described as an "alpha/preview release," suggesting ongoing development and potential instability.
  • Integrations are noted with Axolotl (experimental).
  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. However, the dependencies (Hugging Face Transformers, bitsandbytes, PyTorch) generally permit commercial use, but specific restrictions may apply to the base models used (e.g., Llama 2).

Limitations & Caveats

This is an alpha/preview release, requiring users comfortable with testing and debugging. Custom model loading code is necessary due to incompatibilities with Hugging Face Transformers for quantized weights. Careful configuration of FSDP's Mixed Precision is required to avoid corrupting quantized weights.

Health Check
Last commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
50 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Feedback? Help us improve.