EfficientQAT by OpenGVLab

PyTorch implementation for efficient quantization-aware training of LLMs

Created 1 year ago

324 stars

Top 84.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides the official PyTorch implementation for EfficientQAT, a method for efficient quantization-aware training of large language models (LLMs). It targets researchers and engineers seeking to reduce LLM memory footprint and inference costs while minimizing accuracy degradation, offering pre-quantized models and tools for training and conversion.

How It Works

EfficientQAT employs a two-phase training strategy: Block-wise training of all parameters (Block-AP) followed by end-to-end training of quantization parameters (E2E-QP). This approach aims to push the limits of uniform quantization by efficiently optimizing quantization parameters, enabling significant model compression with minimal performance loss.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using conda create -n efficientqat python==3.11, conda activate efficientqat, and pip install -r requirements.txt.
Prerequisites: Python 3.11, PyTorch. GPU is highly recommended for training.
Resources: Training requires significant GPU memory, depending on model size and quantization level. Pre-quantized models are available on Hugging Face.
Docs: Model Zoo, Training, Inference

Highlighted Details

Supports quantization for Llama-2, Llama-3, and Mistral-Large-Instruct models.
Achieves significant compression, e.g., Llama-2-70B w4g128 reduces size to 35.8 GB with minimal accuracy loss.
Enables transfer of EfficientQAT models to GPTQ v2 and BitBLAS formats for compatibility with existing inference engines.
Introduces PrefixQuant, a new weight-activation quantization algorithm that surpasses dynamic quantization performance.

Maintenance & Community

The project is associated with OpenGVLab and has seen recent updates in August and October 2024, including support for Mistral-Large-Instruct and the PrefixQuant algorithm.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes potential issues with AutoGPTQ for asymmetric quantization, recommending the use of the GPTQModel fork. Speedup issues with BitBLAS conversion are also mentioned.

EfficientQAT by OpenGVLab

Explore Similar Projects

Awesome-LLM-Quantization by pprp

LLM-QAT by facebookresearch

VPTQ by microsoft

quip-sharp by Cornell-RelaxML

OmniQuant by OpenGVLab

deepcompressor by nunchaku-tech

hqq by dropbox

gptq by IST-DASLab

smoothquant by mit-han-lab

Awesome-Model-Quantization by Efficient-ML

llm-awq by mit-han-lab

ao by pytorch