PyTorch implementation for efficient quantization-aware training of LLMs
Top 92.3% on sourcepulse
This repository provides the official PyTorch implementation for EfficientQAT, a method for efficient quantization-aware training of large language models (LLMs). It targets researchers and engineers seeking to reduce LLM memory footprint and inference costs while minimizing accuracy degradation, offering pre-quantized models and tools for training and conversion.
How It Works
EfficientQAT employs a two-phase training strategy: Block-wise training of all parameters (Block-AP) followed by end-to-end training of quantization parameters (E2E-QP). This approach aims to push the limits of uniform quantization by efficiently optimizing quantization parameters, enabling significant model compression with minimal performance loss.
Quick Start & Requirements
conda create -n efficientqat python==3.11
, conda activate efficientqat
, and pip install -r requirements.txt
.Highlighted Details
Maintenance & Community
The project is associated with OpenGVLab and has seen recent updates in August and October 2024, including support for Mistral-Large-Instruct and the PrefixQuant algorithm.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
The README notes potential issues with AutoGPTQ for asymmetric quantization, recommending the use of the GPTQModel
fork. Speedup issues with BitBLAS conversion are also mentioned.
2 months ago
1 day