QuIP by Cornell-RelaxML

Code for LLM quantization research

Created 2 years ago

392 stars

Top 73.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

QuIP provides code for 2-bit quantization of large language models (LLMs) using an "incoherence processing" technique, enabling significant model compression with minimal performance degradation. It's targeted at researchers and engineers working with LLMs who need to reduce memory footprint and inference costs. The primary benefit is achieving near FP16 performance at 2-bit precision.

How It Works

QuIP builds upon the OPTQ repository, introducing "incoherence processing" which involves specific pre- and post-processing steps (--incoh_processing meta-argument). This approach, detailed in their paper, allows for stable quantization to 2 bits by managing quantization errors. The repository also includes implementations of various quantization algorithms like LDLQ, LDLQ_RG, and GPTQ, with a focus on theoretical analysis and empirical verification of their equivalence.

Quick Start & Requirements

Install/Run: Use Python scripts provided in the repository (e.g., opt.py, main.py).
Prerequisites: Python, CUDA (implied for performance), specific LLM model weights (e.g., facebook/opt-125m).
Resources: Requires GPU for quantization and evaluation. Larger models may benefit from --lazy_batch for memory efficiency.
Docs: Refer to the paper for full details.

Highlighted Details

Achieves near FP16 performance at 2-bit quantization for models like Llama 1 and 2.
Introduces "QuIP#," an improved method with lattice codebooks and efficient CUDA implementation.
Provides implementations and comparisons for LDLQ, LDLQ_RG, GPTQ, allbal, and ldlbal_admm quantization methods.
Includes scripts for benchmarking, verifying algorithm equivalence, and computing proxy losses.

Maintenance & Community

The project is associated with Cornell-RelaxML. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Quantization algorithms can be slow on larger models due to low compute-to-memory-access ratios. The README mentions evaluation with a fixed context length (2048) for Llama-2, which may need adjustment.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days