LoRA finetuning for large language models on limited-memory devices
Top 68.1% on sourcepulse
This project enables fine-tuning of large language models like Llama2 and CodeLlama on consumer hardware, specifically Apple Silicon Macs and NVIDIA GPUs, without requiring quantization. It addresses the challenge of limited VRAM by offloading model parts to SSD or main memory, making large model fine-tuning accessible to a wider audience.
How It Works
The core innovation is a novel offloading strategy that segments model weights and stores them on SSD or RAM. During forward and backward passes, only necessary model components are loaded into memory. This process involves a two-pass approach for backpropagation: the first pass caches intermediate activations on disk, and the second pass re-computes gradients using these cached values. The current implementation uses LoRA to limit updates to a smaller parameter set, reducing the computational and storage overhead.
Quick Start & Requirements
pip install torch sentencepiece numpy
tokenizer.model
in the same directory as model weights.python prepare_model.py
to convert models to a sequential format.python finetune.py
.Highlighted Details
Maintenance & Community
The project appears to be a personal effort with limited public community engagement indicated in the README. Contact information is provided via a GitHub handle and email.
Licensing & Compatibility
The project's model definition is based on llama2.c
, which is MIT licensed. Compatibility for commercial use or closed-source linking is generally permissive due to the MIT license of core components, but users should verify the licenses of the base models they use.
Limitations & Caveats
The project is experimental, particularly for CUDA. Full fine-tuning was removed due to concerns about SSD write endurance. Performance on 70B models can be slow on older hardware, and prefetching/async saving are noted as potential optimizations.
1 year ago
1 day