Code for a NeurIPS 2024 research paper on LLM quantization
Top 72.3% on sourcepulse
QuaRot introduces an end-to-end 4-bit quantization scheme for Large Language Models (LLMs), targeting researchers and engineers seeking to reduce model size and inference costs. By rotating LLMs to remove outliers in hidden states and activations without altering outputs, QuaRot enables all matrix multiplications to operate at 4-bit precision, eliminating the need for higher-precision channels.
How It Works
QuaRot employs a novel rotation-based quantization approach. It mathematically transforms the LLM's hidden states and activations, effectively pushing outlier values out of the representable range of 4-bit integers. This computational invariance allows for aggressive quantization across all model components, including weights, activations, and the KV cache, leading to significant memory and computational savings.
Quick Start & Requirements
pip install -e .
or pip install .
.Highlighted Details
Maintenance & Community
The project is associated with the NeurIPS 2024 paper "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." Further community or maintenance details are not specified in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README.
Limitations & Caveats
The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be research-oriented, and its readiness for production deployment may require further evaluation.
8 months ago
1 day