slowllama  by okuvshynov

LoRA finetuning for large language models on limited-memory devices

created 1 year ago
448 stars

Top 68.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project enables fine-tuning of large language models like Llama2 and CodeLlama on consumer hardware, specifically Apple Silicon Macs and NVIDIA GPUs, without requiring quantization. It addresses the challenge of limited VRAM by offloading model parts to SSD or main memory, making large model fine-tuning accessible to a wider audience.

How It Works

The core innovation is a novel offloading strategy that segments model weights and stores them on SSD or RAM. During forward and backward passes, only necessary model components are loaded into memory. This process involves a two-pass approach for backpropagation: the first pass caches intermediate activations on disk, and the second pass re-computes gradients using these cached values. The current implementation uses LoRA to limit updates to a smaller parameter set, reducing the computational and storage overhead.

Quick Start & Requirements

  • Install dependencies: pip install torch sentencepiece numpy
  • Clone the Llama2 repository and download models. Place tokenizer.model in the same directory as model weights.
  • Run python prepare_model.py to convert models to a sequential format.
  • Fine-tune with python finetune.py.
  • Requires Python 3.x, PyTorch, SentencePiece, and NumPy. Tested on Apple M1 (16GB RAM) and M2 (24GB RAM).

Highlighted Details

  • Enables fine-tuning of Llama2-70B and CodeLlama models on devices with limited RAM.
  • Achieves high GPU utilization during backward passes by optimizing data loading.
  • Supports both Apple Silicon (MPS) and NVIDIA GPUs.
  • Includes functionality to merge LoRA weights back into the original model format.

Maintenance & Community

The project appears to be a personal effort with limited public community engagement indicated in the README. Contact information is provided via a GitHub handle and email.

Licensing & Compatibility

The project's model definition is based on llama2.c, which is MIT licensed. Compatibility for commercial use or closed-source linking is generally permissive due to the MIT license of core components, but users should verify the licenses of the base models they use.

Limitations & Caveats

The project is experimental, particularly for CUDA. Full fine-tuning was removed due to concerns about SSD write endurance. Performance on 70B models can be slow on older hardware, and prefetching/async saving are noted as potential optimizations.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Feedback? Help us improve.