QuaRot  by spcl

Code for a NeurIPS 2024 research paper on LLM quantization

Created 1 year ago
424 stars

Top 69.5% on SourcePulse

GitHubView on GitHub
Project Summary

QuaRot introduces an end-to-end 4-bit quantization scheme for Large Language Models (LLMs), targeting researchers and engineers seeking to reduce model size and inference costs. By rotating LLMs to remove outliers in hidden states and activations without altering outputs, QuaRot enables all matrix multiplications to operate at 4-bit precision, eliminating the need for higher-precision channels.

How It Works

QuaRot employs a novel rotation-based quantization approach. It mathematically transforms the LLM's hidden states and activations, effectively pushing outlier values out of the representable range of 4-bit integers. This computational invariance allows for aggressive quantization across all model components, including weights, activations, and the KV cache, leading to significant memory and computational savings.

Quick Start & Requirements

  • Install by cloning the repository and running pip install -e . or pip install ..
  • Requires a C++ compiler for kernel compilation.
  • Official documentation and citation details are available in the repository.

Highlighted Details

  • Achieves 4-bit end-to-end quantization for LLMs, including weights, activations, and KV cache.
  • Demonstrates minimal performance degradation: LLaMa2-70B quantized model shows losses of at most 0.29 WikiText perplexity and retains 99% of zero-shot performance.
  • Addresses outlier issues in quantization by rotating hidden states and activations.

Maintenance & Community

The project is associated with the NeurIPS 2024 paper "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." Further community or maintenance details are not specified in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be research-oriented, and its readiness for production deployment may require further evaluation.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.