OmniQuant is a quantization technique for Large Language Models (LLMs) that enables significant model compression with minimal performance degradation. It targets researchers and developers seeking to deploy LLMs on resource-constrained environments, offering various weight-only and weight-activation quantization schemes.
How It Works
OmniQuant employs omnidirectional calibration, a novel approach that considers multiple calibration directions to optimize quantization accuracy. This method is advantageous for achieving state-of-the-art performance in both weight-only (e.g., W3A16) and weight-activation (e.g., W4A4) quantization, outperforming existing techniques. The framework also incorporates Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to further enhance quantization quality.
Quick Start & Requirements
- Install: Clone the repository and install via pip:
pip install -e .
- Prerequisites: Python 3.10, CUDA, and a bug-fixed version of AutoGPTQ (provided in the repo).
- Setup: Requires downloading pre-trained OmniQuant parameters from Huggingface and potentially calibration data.
- Docs: Usage, MLC-LLM Integration
Highlighted Details
- Supports various quantization levels: W4A16, W3A16, W2A16, W6A6, W4A4.
- Offers pre-trained OmniQuant model zoo for LLaMA, OPT, Falcon, and Mixtral.
- Enables deployment on diverse hardware via MLC-LLM, including mobile phones.
- Achieves near-lossless 4-bit quantization for Mixtral-8x7B, reducing memory from 87GB to 23GB.
- Compresses Falcon-180b from 335GB to 65GB for single A100 80GB GPU inference.
Maintenance & Community
- Active development with recent updates including PrefixQuant and EfficientQAT.
- Paper accepted as a Spotlight presentation at ICLR 2024.
- Related projects include SmoothQuant, AWQ, GPTQ, and RPTQ.
Licensing & Compatibility
- License: MIT License.
- Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.
Limitations & Caveats
- "Real quantization" for weight-only quantization may lead to slower inference speeds due to AutoGPTQ kernel limitations.
- 2-bit quantization (W2A16) shows worse performance, included as an extreme deployment exploration.