OmniQuant  by OpenGVLab

OmniQuant: LLM quantization research paper

created 1 year ago
833 stars

Top 43.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OmniQuant is a quantization technique for Large Language Models (LLMs) that enables significant model compression with minimal performance degradation. It targets researchers and developers seeking to deploy LLMs on resource-constrained environments, offering various weight-only and weight-activation quantization schemes.

How It Works

OmniQuant employs omnidirectional calibration, a novel approach that considers multiple calibration directions to optimize quantization accuracy. This method is advantageous for achieving state-of-the-art performance in both weight-only (e.g., W3A16) and weight-activation (e.g., W4A4) quantization, outperforming existing techniques. The framework also incorporates Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to further enhance quantization quality.

Quick Start & Requirements

  • Install: Clone the repository and install via pip: pip install -e .
  • Prerequisites: Python 3.10, CUDA, and a bug-fixed version of AutoGPTQ (provided in the repo).
  • Setup: Requires downloading pre-trained OmniQuant parameters from Huggingface and potentially calibration data.
  • Docs: Usage, MLC-LLM Integration

Highlighted Details

  • Supports various quantization levels: W4A16, W3A16, W2A16, W6A6, W4A4.
  • Offers pre-trained OmniQuant model zoo for LLaMA, OPT, Falcon, and Mixtral.
  • Enables deployment on diverse hardware via MLC-LLM, including mobile phones.
  • Achieves near-lossless 4-bit quantization for Mixtral-8x7B, reducing memory from 87GB to 23GB.
  • Compresses Falcon-180b from 335GB to 65GB for single A100 80GB GPU inference.

Maintenance & Community

  • Active development with recent updates including PrefixQuant and EfficientQAT.
  • Paper accepted as a Spotlight presentation at ICLR 2024.
  • Related projects include SmoothQuant, AWQ, GPTQ, and RPTQ.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • "Real quantization" for weight-only quantization may lead to slower inference speeds due to AutoGPTQ kernel limitations.
  • 2-bit quantization (W2A16) shows worse performance, included as an extreme deployment exploration.
Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
34 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.