LLM-QAT  by facebookresearch

Research paper code for data-free quantization-aware training (QAT) of LLMs

created 2 years ago
305 stars

Top 88.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code for LLM-QAT, a data-free quantization-aware training method for large language models. It targets researchers and engineers aiming to reduce LLM memory footprint and improve inference speed by quantizing weights, activations, and the KV cache, achieving significant performance gains over training-free methods.

How It Works

LLM-QAT employs quantization-aware training (QAT) to fine-tune LLMs for lower precision inference. A key innovation is the simultaneous quantization of weights, activations, and the KV cache, which is crucial for throughput and long-sequence handling. The method synthesizes calibration data to facilitate the QAT process without requiring a large, task-specific dataset.

Quick Start & Requirements

  • Install: pip install -r requirement.txt and install Apex from source.
  • Prerequisites: Python 3.9, PyTorch >= 1.13.
  • Data Synthesis: Requires downloading a LLaMA model from HuggingFace and running generate_data.py (potentially across 64 GPUs) followed by merge_gen_data.py.
  • Training: Execute run_train.sh with specified bit configurations (e.g., bash run_train.sh 4 8 4).
  • Links: Paper, Code

Highlighted Details

  • Achieves up to ~20 points improvement over training-free methods for 4-bit weight, 8-bit activation, and 4-bit KV cache quantization.
  • Experiments conducted on LLaMA 7B, 13B, and 30B models.
  • Includes zero-shot common sense reasoning accuracy benchmarks for quantized LLaMA-7B models.

Maintenance & Community

Licensing & Compatibility

  • License: CC-BY-NC 4.0.
  • This license restricts commercial use and derivative works intended for commercial purposes.

Limitations & Caveats

The data synthesis step is designed for parallel execution across many GPUs, which may be a barrier for users with limited hardware. The reported results are based on an internal LLaMA codebase, though reproduced with HuggingFace, suggesting potential minor discrepancies.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Feedback? Help us improve.