alpaca_lora_4bit  by johnsmith0031

Fine-tuning and inference tool for quantized LLaMA models

Created 2 years ago
535 stars

Top 59.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a method for LoRA fine-tuning of large language models quantized to 4-bit precision, enabling efficient training on consumer hardware. It targets researchers and developers working with LLMs who need to adapt models to specific tasks with limited VRAM.

How It Works

The project modifies existing libraries like PEFT and GPTQ-for-LLaMA to enable LoRA fine-tuning on models already quantized to 4-bit. It reconstructs FP16 matrices from 4-bit data and utilizes torch.matmul for significantly faster inference. The approach supports various bit quantizations (2, 3, 4, 8 bits) and includes optimizations like gradient checkpointing, Flash Attention, and Triton backends for enhanced performance and reduced memory usage.

Quick Start & Requirements

  • Install: pip install . (after cloning and checking out the winglian-setup_pip branch).
  • Prerequisites: Python, PyTorch. GPU with CUDA is highly recommended for performance.
  • Docker: Available, but noted as not currently working.
  • Docs: Installation manual available.

Highlighted Details

  • Enables 4-bit LoRA fine-tuning on models like Llama and Llama 2.
  • Achieves faster inference (e.g., 20 tokens/sec on a 7B model with optimizations).
  • Supports gradient checkpointing for fine-tuning 30B models on 24GB VRAM.
  • Integrates with text-generation-webui via monkey patching for improved inference performance.
  • Offers Flash Attention 2 and Triton backend support.

Maintenance & Community

The project has seen contributions from multiple users, indicating active development and community interest. Specific community links (Discord/Slack) are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The Docker build is noted as not currently working. The monkey patch for text-generation-webui may break certain web UI features like model selection, LORA selection, and training. Quantization attention and fused MLP patches require PyTorch 2.0+ and only support simple LoRA injections (q_proj, v_proj).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Didier Lopes Didier Lopes(Founder of OpenBB), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

mlx-lm by ml-explore

26.1%
2k
Python package for LLM text generation and fine-tuning on Apple silicon
Created 6 months ago
Updated 23 hours ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.