QuIP  by Cornell-RelaxML

Code for LLM quantization research

created 2 years ago
376 stars

Top 76.7% on sourcepulse

GitHubView on GitHub
Project Summary

QuIP provides code for 2-bit quantization of large language models (LLMs) using an "incoherence processing" technique, enabling significant model compression with minimal performance degradation. It's targeted at researchers and engineers working with LLMs who need to reduce memory footprint and inference costs. The primary benefit is achieving near FP16 performance at 2-bit precision.

How It Works

QuIP builds upon the OPTQ repository, introducing "incoherence processing" which involves specific pre- and post-processing steps (--incoh_processing meta-argument). This approach, detailed in their paper, allows for stable quantization to 2 bits by managing quantization errors. The repository also includes implementations of various quantization algorithms like LDLQ, LDLQ_RG, and GPTQ, with a focus on theoretical analysis and empirical verification of their equivalence.

Quick Start & Requirements

  • Install/Run: Use Python scripts provided in the repository (e.g., opt.py, main.py).
  • Prerequisites: Python, CUDA (implied for performance), specific LLM model weights (e.g., facebook/opt-125m).
  • Resources: Requires GPU for quantization and evaluation. Larger models may benefit from --lazy_batch for memory efficiency.
  • Docs: Refer to the paper for full details.

Highlighted Details

  • Achieves near FP16 performance at 2-bit quantization for models like Llama 1 and 2.
  • Introduces "QuIP#," an improved method with lattice codebooks and efficient CUDA implementation.
  • Provides implementations and comparisons for LDLQ, LDLQ_RG, GPTQ, allbal, and ldlbal_admm quantization methods.
  • Includes scripts for benchmarking, verifying algorithm equivalence, and computing proxy losses.

Maintenance & Community

The project is associated with Cornell-RelaxML. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Quantization algorithms can be slow on larger models due to low compute-to-memory-access ratios. The README mentions evaluation with a fixed context length (2048) for Llama-2, which may need adjustment.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.