Code for LLM quantization research
Top 76.7% on sourcepulse
QuIP provides code for 2-bit quantization of large language models (LLMs) using an "incoherence processing" technique, enabling significant model compression with minimal performance degradation. It's targeted at researchers and engineers working with LLMs who need to reduce memory footprint and inference costs. The primary benefit is achieving near FP16 performance at 2-bit precision.
How It Works
QuIP builds upon the OPTQ repository, introducing "incoherence processing" which involves specific pre- and post-processing steps (--incoh_processing
meta-argument). This approach, detailed in their paper, allows for stable quantization to 2 bits by managing quantization errors. The repository also includes implementations of various quantization algorithms like LDLQ, LDLQ_RG, and GPTQ, with a focus on theoretical analysis and empirical verification of their equivalence.
Quick Start & Requirements
opt.py
, main.py
).facebook/opt-125m
).--lazy_batch
for memory efficiency.Highlighted Details
Maintenance & Community
The project is associated with Cornell-RelaxML. Further community or maintenance details are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Quantization algorithms can be slow on larger models due to low compute-to-memory-access ratios. The README mentions evaluation with a fixed context length (2048) for Llama-2, which may need adjustment.
1 year ago
1+ week