OmniQuant by OpenGVLab

OmniQuant: LLM quantization research paper

Created 2 years ago

885 stars

Top 40.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeremy Howard

Cofounder of fast.ai

Project Summary

OmniQuant is a quantization technique for Large Language Models (LLMs) that enables significant model compression with minimal performance degradation. It targets researchers and developers seeking to deploy LLMs on resource-constrained environments, offering various weight-only and weight-activation quantization schemes.

How It Works

OmniQuant employs omnidirectional calibration, a novel approach that considers multiple calibration directions to optimize quantization accuracy. This method is advantageous for achieving state-of-the-art performance in both weight-only (e.g., W3A16) and weight-activation (e.g., W4A4) quantization, outperforming existing techniques. The framework also incorporates Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to further enhance quantization quality.

Quick Start & Requirements

Install: Clone the repository and install via pip: pip install -e .
Prerequisites: Python 3.10, CUDA, and a bug-fixed version of AutoGPTQ (provided in the repo).
Setup: Requires downloading pre-trained OmniQuant parameters from Huggingface and potentially calibration data.
Docs: Usage, MLC-LLM Integration

Highlighted Details

Supports various quantization levels: W4A16, W3A16, W2A16, W6A6, W4A4.
Offers pre-trained OmniQuant model zoo for LLaMA, OPT, Falcon, and Mixtral.
Enables deployment on diverse hardware via MLC-LLM, including mobile phones.
Achieves near-lossless 4-bit quantization for Mixtral-8x7B, reducing memory from 87GB to 23GB.
Compresses Falcon-180b from 335GB to 65GB for single A100 80GB GPU inference.

Maintenance & Community

Active development with recent updates including PrefixQuant and EfficientQAT.
Paper accepted as a Spotlight presentation at ICLR 2024.
Related projects include SmoothQuant, AWQ, GPTQ, and RPTQ.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

"Real quantization" for weight-only quantization may lead to slower inference speeds due to AutoGPTQ kernel limitations.
2-bit quantization (W2A16) shows worse performance, included as an extreme deployment exploration.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days