Research paper code for data-free quantization-aware training (QAT) of LLMs
Top 88.8% on sourcepulse
This repository provides the code for LLM-QAT, a data-free quantization-aware training method for large language models. It targets researchers and engineers aiming to reduce LLM memory footprint and improve inference speed by quantizing weights, activations, and the KV cache, achieving significant performance gains over training-free methods.
How It Works
LLM-QAT employs quantization-aware training (QAT) to fine-tune LLMs for lower precision inference. A key innovation is the simultaneous quantization of weights, activations, and the KV cache, which is crucial for throughput and long-sequence handling. The method synthesizes calibration data to facilitate the QAT process without requiring a large, task-specific dataset.
Quick Start & Requirements
pip install -r requirement.txt
and install Apex from source.generate_data.py
(potentially across 64 GPUs) followed by merge_gen_data.py
.run_train.sh
with specified bit configurations (e.g., bash run_train.sh 4 8 4
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The data synthesis step is designed for parallel execution across many GPUs, which may be a barrier for users with limited hardware. The reported results are based on an internal LLaMA codebase, though reproduced with HuggingFace, suggesting potential minor discrepancies.
5 months ago
1 week