QuaRot  by spcl

Code for a NeurIPS 2024 research paper on LLM quantization

created 1 year ago
410 stars

Top 72.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

QuaRot introduces an end-to-end 4-bit quantization scheme for Large Language Models (LLMs), targeting researchers and engineers seeking to reduce model size and inference costs. By rotating LLMs to remove outliers in hidden states and activations without altering outputs, QuaRot enables all matrix multiplications to operate at 4-bit precision, eliminating the need for higher-precision channels.

How It Works

QuaRot employs a novel rotation-based quantization approach. It mathematically transforms the LLM's hidden states and activations, effectively pushing outlier values out of the representable range of 4-bit integers. This computational invariance allows for aggressive quantization across all model components, including weights, activations, and the KV cache, leading to significant memory and computational savings.

Quick Start & Requirements

  • Install by cloning the repository and running pip install -e . or pip install ..
  • Requires a C++ compiler for kernel compilation.
  • Official documentation and citation details are available in the repository.

Highlighted Details

  • Achieves 4-bit end-to-end quantization for LLMs, including weights, activations, and KV cache.
  • Demonstrates minimal performance degradation: LLaMa2-70B quantized model shows losses of at most 0.29 WikiText perplexity and retains 99% of zero-shot performance.
  • Addresses outlier issues in quantization by rotating hidden states and activations.

Maintenance & Community

The project is associated with the NeurIPS 2024 paper "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." Further community or maintenance details are not specified in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be research-oriented, and its readiness for production deployment may require further evaluation.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
35 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
created 2 years ago
updated 1 year ago
Feedback? Help us improve.