EfficientQAT  by OpenGVLab

PyTorch implementation for efficient quantization-aware training of LLMs

created 1 year ago
287 stars

Top 92.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the official PyTorch implementation for EfficientQAT, a method for efficient quantization-aware training of large language models (LLMs). It targets researchers and engineers seeking to reduce LLM memory footprint and inference costs while minimizing accuracy degradation, offering pre-quantized models and tools for training and conversion.

How It Works

EfficientQAT employs a two-phase training strategy: Block-wise training of all parameters (Block-AP) followed by end-to-end training of quantization parameters (E2E-QP). This approach aims to push the limits of uniform quantization by efficiently optimizing quantization parameters, enabling significant model compression with minimal performance loss.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using conda create -n efficientqat python==3.11, conda activate efficientqat, and pip install -r requirements.txt.
  • Prerequisites: Python 3.11, PyTorch. GPU is highly recommended for training.
  • Resources: Training requires significant GPU memory, depending on model size and quantization level. Pre-quantized models are available on Hugging Face.
  • Docs: Model Zoo, Training, Inference

Highlighted Details

  • Supports quantization for Llama-2, Llama-3, and Mistral-Large-Instruct models.
  • Achieves significant compression, e.g., Llama-2-70B w4g128 reduces size to 35.8 GB with minimal accuracy loss.
  • Enables transfer of EfficientQAT models to GPTQ v2 and BitBLAS formats for compatibility with existing inference engines.
  • Introduces PrefixQuant, a new weight-activation quantization algorithm that surpasses dynamic quantization performance.

Maintenance & Community

The project is associated with OpenGVLab and has seen recent updates in August and October 2024, including support for Mistral-Large-Instruct and the PrefixQuant algorithm.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes potential issues with AutoGPTQ for asymmetric quantization, recommending the use of the GPTQModel fork. Speedup issues with BitBLAS conversion are also mentioned.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.