Fine-tuning and inference tool for quantized LLaMA models
Top 60.1% on sourcepulse
This repository provides a method for LoRA fine-tuning of large language models quantized to 4-bit precision, enabling efficient training on consumer hardware. It targets researchers and developers working with LLMs who need to adapt models to specific tasks with limited VRAM.
How It Works
The project modifies existing libraries like PEFT and GPTQ-for-LLaMA to enable LoRA fine-tuning on models already quantized to 4-bit. It reconstructs FP16 matrices from 4-bit data and utilizes torch.matmul
for significantly faster inference. The approach supports various bit quantizations (2, 3, 4, 8 bits) and includes optimizations like gradient checkpointing, Flash Attention, and Triton backends for enhanced performance and reduced memory usage.
Quick Start & Requirements
pip install .
(after cloning and checking out the winglian-setup_pip
branch).Highlighted Details
text-generation-webui
via monkey patching for improved inference performance.Maintenance & Community
The project has seen contributions from multiple users, indicating active development and community interest. Specific community links (Discord/Slack) are not provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The Docker build is noted as not currently working. The monkey patch for text-generation-webui
may break certain web UI features like model selection, LORA selection, and training. Quantization attention and fused MLP patches require PyTorch 2.0+ and only support simple LoRA injections (q_proj, v_proj).
1 year ago
1 day