LLM-QAT  by facebookresearch

Research paper code for data-free quantization-aware training (QAT) of LLMs

Created 2 years ago
312 stars

Top 86.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code for LLM-QAT, a data-free quantization-aware training method for large language models. It targets researchers and engineers aiming to reduce LLM memory footprint and improve inference speed by quantizing weights, activations, and the KV cache, achieving significant performance gains over training-free methods.

How It Works

LLM-QAT employs quantization-aware training (QAT) to fine-tune LLMs for lower precision inference. A key innovation is the simultaneous quantization of weights, activations, and the KV cache, which is crucial for throughput and long-sequence handling. The method synthesizes calibration data to facilitate the QAT process without requiring a large, task-specific dataset.

Quick Start & Requirements

  • Install: pip install -r requirement.txt and install Apex from source.
  • Prerequisites: Python 3.9, PyTorch >= 1.13.
  • Data Synthesis: Requires downloading a LLaMA model from HuggingFace and running generate_data.py (potentially across 64 GPUs) followed by merge_gen_data.py.
  • Training: Execute run_train.sh with specified bit configurations (e.g., bash run_train.sh 4 8 4).
  • Links: Paper, Code

Highlighted Details

  • Achieves up to ~20 points improvement over training-free methods for 4-bit weight, 8-bit activation, and 4-bit KV cache quantization.
  • Experiments conducted on LLaMA 7B, 13B, and 30B models.
  • Includes zero-shot common sense reasoning accuracy benchmarks for quantized LLaMA-7B models.

Maintenance & Community

Licensing & Compatibility

  • License: CC-BY-NC 4.0.
  • This license restricts commercial use and derivative works intended for commercial purposes.

Limitations & Caveats

The data synthesis step is designed for parallel execution across many GPUs, which may be a barrier for users with limited hardware. The reported results are based on an internal LLaMA codebase, though reproduced with HuggingFace, suggesting potential minor discrepancies.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Zack Li Zack Li(Cofounder of Nexa AI), and
4 more.

smoothquant by mit-han-lab

0.3%
2k
Post-training quantization research paper for large language models
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.