AutoAWQ is a tool for 4-bit quantized LLM inference
Top 20.8% on sourcepulse
AutoAWQ provides an easy-to-use package for 4-bit quantization of Large Language Models (LLMs), significantly reducing memory requirements and accelerating inference. It implements the Activation-aware Weight Quantization (AWQ) algorithm, offering substantial speedups and memory savings compared to FP16 models, making LLMs more accessible for researchers and developers.
How It Works
AutoAWQ implements the AWQ algorithm, which quantizes LLM weights to INT4 while preserving performance by considering activation properties. It offers two quantization modes: GEMV (faster for batch size 1, less suitable for large contexts) and GEMM (faster than FP16 for batch sizes below 8, better for large contexts). The project also leverages fused modules, combining multiple layers into single operations for increased efficiency, often utilizing FasterTransformer kernels for Linux environments.
Quick Start & Requirements
pip install autoawq
or pip install autoawq[kernels]
for pre-built kernels.Highlighted Details
transformers
.Maintenance & Community
The project is being deprecated and archived as its functionality is fully adopted by vLLM Compressor and MLX-LM. Notable community support includes Intel for CPU optimizations.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require license clarification.
Limitations & Caveats
The project is deprecated. Fused modules require fuse_layers=True
, which restricts sequence length changes after model creation and is incompatible with Windows. FasterTransformer kernels are Linux-only.
2 months ago
1 day