QuIP  by Cornell-RelaxML

Code for LLM quantization research

Created 2 years ago
380 stars

Top 75.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

QuIP provides code for 2-bit quantization of large language models (LLMs) using an "incoherence processing" technique, enabling significant model compression with minimal performance degradation. It's targeted at researchers and engineers working with LLMs who need to reduce memory footprint and inference costs. The primary benefit is achieving near FP16 performance at 2-bit precision.

How It Works

QuIP builds upon the OPTQ repository, introducing "incoherence processing" which involves specific pre- and post-processing steps (--incoh_processing meta-argument). This approach, detailed in their paper, allows for stable quantization to 2 bits by managing quantization errors. The repository also includes implementations of various quantization algorithms like LDLQ, LDLQ_RG, and GPTQ, with a focus on theoretical analysis and empirical verification of their equivalence.

Quick Start & Requirements

  • Install/Run: Use Python scripts provided in the repository (e.g., opt.py, main.py).
  • Prerequisites: Python, CUDA (implied for performance), specific LLM model weights (e.g., facebook/opt-125m).
  • Resources: Requires GPU for quantization and evaluation. Larger models may benefit from --lazy_batch for memory efficiency.
  • Docs: Refer to the paper for full details.

Highlighted Details

  • Achieves near FP16 performance at 2-bit quantization for models like Llama 1 and 2.
  • Introduces "QuIP#," an improved method with lattice codebooks and efficient CUDA implementation.
  • Provides implementations and comparisons for LDLQ, LDLQ_RG, GPTQ, allbal, and ldlbal_admm quantization methods.
  • Includes scripts for benchmarking, verifying algorithm equivalence, and computing proxy losses.

Maintenance & Community

The project is associated with Cornell-RelaxML. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Quantization algorithms can be slow on larger models due to low compute-to-memory-access ratios. The README mentions evaluation with a fixed context length (2048) for Llama-2, which may need adjustment.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

0.5%
424
Code for a NeurIPS 2024 research paper on LLM quantization
Created 1 year ago
Updated 9 months ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.