OmniQuant  by OpenGVLab

OmniQuant: LLM quantization research paper

Created 2 years ago
849 stars

Top 42.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OmniQuant is a quantization technique for Large Language Models (LLMs) that enables significant model compression with minimal performance degradation. It targets researchers and developers seeking to deploy LLMs on resource-constrained environments, offering various weight-only and weight-activation quantization schemes.

How It Works

OmniQuant employs omnidirectional calibration, a novel approach that considers multiple calibration directions to optimize quantization accuracy. This method is advantageous for achieving state-of-the-art performance in both weight-only (e.g., W3A16) and weight-activation (e.g., W4A4) quantization, outperforming existing techniques. The framework also incorporates Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET) to further enhance quantization quality.

Quick Start & Requirements

  • Install: Clone the repository and install via pip: pip install -e .
  • Prerequisites: Python 3.10, CUDA, and a bug-fixed version of AutoGPTQ (provided in the repo).
  • Setup: Requires downloading pre-trained OmniQuant parameters from Huggingface and potentially calibration data.
  • Docs: Usage, MLC-LLM Integration

Highlighted Details

  • Supports various quantization levels: W4A16, W3A16, W2A16, W6A6, W4A4.
  • Offers pre-trained OmniQuant model zoo for LLaMA, OPT, Falcon, and Mixtral.
  • Enables deployment on diverse hardware via MLC-LLM, including mobile phones.
  • Achieves near-lossless 4-bit quantization for Mixtral-8x7B, reducing memory from 87GB to 23GB.
  • Compresses Falcon-180b from 335GB to 65GB for single A100 80GB GPU inference.

Maintenance & Community

  • Active development with recent updates including PrefixQuant and EfficientQAT.
  • Paper accepted as a Spotlight presentation at ICLR 2024.
  • Related projects include SmoothQuant, AWQ, GPTQ, and RPTQ.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • "Real quantization" for weight-only quantization may lead to slower inference speeds due to AutoGPTQ kernel limitations.
  • 2-bit quantization (W2A16) shows worse performance, included as an extreme deployment exploration.
Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Zack Li Zack Li(Cofounder of Nexa AI), and
4 more.

smoothquant by mit-han-lab

0.3%
2k
Post-training quantization research paper for large language models
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.