LLM-QAT by facebookresearch

Research paper code for data-free quantization-aware training (QAT) of LLMs

Created 2 years ago

322 stars

Top 84.5% on SourcePulse

Project Summary

This repository provides the code for LLM-QAT, a data-free quantization-aware training method for large language models. It targets researchers and engineers aiming to reduce LLM memory footprint and improve inference speed by quantizing weights, activations, and the KV cache, achieving significant performance gains over training-free methods.

How It Works

LLM-QAT employs quantization-aware training (QAT) to fine-tune LLMs for lower precision inference. A key innovation is the simultaneous quantization of weights, activations, and the KV cache, which is crucial for throughput and long-sequence handling. The method synthesizes calibration data to facilitate the QAT process without requiring a large, task-specific dataset.

Quick Start & Requirements

Install: pip install -r requirement.txt and install Apex from source.
Prerequisites: Python 3.9, PyTorch >= 1.13.
Data Synthesis: Requires downloading a LLaMA model from HuggingFace and running generate_data.py (potentially across 64 GPUs) followed by merge_gen_data.py.
Training: Execute run_train.sh with specified bit configurations (e.g., bash run_train.sh 4 8 4).
Links: Paper, Code

Highlighted Details

Achieves up to ~20 points improvement over training-free methods for 4-bit weight, 8-bit activation, and 4-bit KV cache quantization.
Experiments conducted on LLaMA 7B, 13B, and 30B models.
Includes zero-shot common sense reasoning accuracy benchmarks for quantized LLaMA-7B models.

Maintenance & Community

Developed by Meta AI researchers.
Contact: Zechun Liu (zechunliu@meta.com), Barlas Oguz (barlaso@meta.com), Changsheng Zhao (cszhao@meta.com).
Related projects: MobileLLM, SpinQuant.

Licensing & Compatibility

License: CC-BY-NC 4.0.
This license restricts commercial use and derivative works intended for commercial purposes.

Limitations & Caveats

The data synthesis step is designed for parallel execution across many GPUs, which may be a barrier for users with limited hardware. The reported results are based on an internal LLaMA codebase, though reproduced with HuggingFace, suggesting potential minor discrepancies.

LLM-QAT by facebookresearch

Explore Similar Projects

Awesome-LLM-Quantization by pprp

EfficientQAT by OpenGVLab

Sparsebit by megvii-research

VPTQ by microsoft

quip-sharp by Cornell-RelaxML

OmniQuant by OpenGVLab

deepcompressor by nunchaku-tech

gptq by IST-DASLab

smoothquant by mit-han-lab

Awesome-Model-Quantization by Efficient-ML

ppq by OpenPPL

GPTQ-for-LLaMa by qwopqwop200