EfficientQAT  by OpenGVLab

PyTorch implementation for efficient quantization-aware training of LLMs

Created 1 year ago
302 stars

Top 88.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the official PyTorch implementation for EfficientQAT, a method for efficient quantization-aware training of large language models (LLMs). It targets researchers and engineers seeking to reduce LLM memory footprint and inference costs while minimizing accuracy degradation, offering pre-quantized models and tools for training and conversion.

How It Works

EfficientQAT employs a two-phase training strategy: Block-wise training of all parameters (Block-AP) followed by end-to-end training of quantization parameters (E2E-QP). This approach aims to push the limits of uniform quantization by efficiently optimizing quantization parameters, enabling significant model compression with minimal performance loss.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using conda create -n efficientqat python==3.11, conda activate efficientqat, and pip install -r requirements.txt.
  • Prerequisites: Python 3.11, PyTorch. GPU is highly recommended for training.
  • Resources: Training requires significant GPU memory, depending on model size and quantization level. Pre-quantized models are available on Hugging Face.
  • Docs: Model Zoo, Training, Inference

Highlighted Details

  • Supports quantization for Llama-2, Llama-3, and Mistral-Large-Instruct models.
  • Achieves significant compression, e.g., Llama-2-70B w4g128 reduces size to 35.8 GB with minimal accuracy loss.
  • Enables transfer of EfficientQAT models to GPTQ v2 and BitBLAS formats for compatibility with existing inference engines.
  • Introduces PrefixQuant, a new weight-activation quantization algorithm that surpasses dynamic quantization performance.

Maintenance & Community

The project is associated with OpenGVLab and has seen recent updates in August and October 2024, including support for Mistral-Large-Instruct and the PrefixQuant algorithm.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes potential issues with AutoGPTQ for asymmetric quantization, recommending the use of the GPTQModel fork. Speedup issues with BitBLAS conversion are also mentioned.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Zack Li Zack Li(Cofounder of Nexa AI), and
4 more.

smoothquant by mit-han-lab

0.3%
2k
Post-training quantization research paper for large language models
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.