slowllama by okuvshynov

LoRA finetuning for large language models on limited-memory devices

Created 2 years ago

450 stars

Top 66.8% on SourcePulse

Project Summary

This project enables fine-tuning of large language models like Llama2 and CodeLlama on consumer hardware, specifically Apple Silicon Macs and NVIDIA GPUs, without requiring quantization. It addresses the challenge of limited VRAM by offloading model parts to SSD or main memory, making large model fine-tuning accessible to a wider audience.

How It Works

The core innovation is a novel offloading strategy that segments model weights and stores them on SSD or RAM. During forward and backward passes, only necessary model components are loaded into memory. This process involves a two-pass approach for backpropagation: the first pass caches intermediate activations on disk, and the second pass re-computes gradients using these cached values. The current implementation uses LoRA to limit updates to a smaller parameter set, reducing the computational and storage overhead.

Quick Start & Requirements

Install dependencies: pip install torch sentencepiece numpy
Clone the Llama2 repository and download models. Place tokenizer.model in the same directory as model weights.
Run python prepare_model.py to convert models to a sequential format.
Fine-tune with python finetune.py.
Requires Python 3.x, PyTorch, SentencePiece, and NumPy. Tested on Apple M1 (16GB RAM) and M2 (24GB RAM).

Highlighted Details

Enables fine-tuning of Llama2-70B and CodeLlama models on devices with limited RAM.
Achieves high GPU utilization during backward passes by optimizing data loading.
Supports both Apple Silicon (MPS) and NVIDIA GPUs.
Includes functionality to merge LoRA weights back into the original model format.

Maintenance & Community

The project appears to be a personal effort with limited public community engagement indicated in the README. Contact information is provided via a GitHub handle and email.

Licensing & Compatibility

The project's model definition is based on llama2.c, which is MIT licensed. Compatibility for commercial use or closed-source linking is generally permissive due to the MIT license of core components, but users should verify the licenses of the base models they use.

Limitations & Caveats

The project is experimental, particularly for CUDA. Full fine-tuning was removed due to concerns about SSD write endurance. Performance on 70B models can be slow on older hardware, and prefetching/async saving are noted as potential optimizations.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days