quip-sharp  by Cornell-RelaxML

LLM quantization for extreme compression

Created 1 year ago
555 stars

Top 57.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

QuIP# is a weight-only post-training quantization method designed for extreme compression of Large Language Models (LLMs) down to 4 bits per weight or less. It targets researchers and practitioners seeking to deploy LLMs with significantly reduced memory footprints and improved inference speeds, offering state-of-the-art performance in highly compressed regimes.

How It Works

QuIP# employs a novel approach combining a randomized Hadamard transform (RHT) for efficient incoherence processing, $E_8$ lattice-based codebooks for fast vector quantization, and a fine-tuning scheme to capture inter-layer dependencies. This combination allows for superior quantization quality, particularly at very low bitrates, outperforming theoretical lossless methods at 4 bits with its 3-bit models.

Quick Start & Requirements

  • Install via pip install -r requirements.txt and build CUDA inference kernels (cd quiptools && python setup.py install).
  • Requires CUDA-enabled GPU.
  • Pre-quantized models and Hessians are available on Hugging Face.
  • Official documentation and examples for Llama models are provided.

Highlighted Details

  • Achieves state-of-the-art performance at $\le 4$ bits per weight.
  • Demonstrates superior scaling for 3-bit models compared to 4-bit models.
  • Offers CUDA kernels for fast inference, with ongoing work to improve 3-bit inference speed.
  • Supports fine-tuning during quantization to capture inter-layer interactions.

Maintenance & Community

This codebase is no longer under active development, with QTIP being the successor method. Users are encouraged to open GitHub tickets for questions. Pre-quantized models are available on Hugging Face.

Licensing & Compatibility

The code is licensed under GNU GPL v3. Use of underlying LLM models (Llama, Mistral) is governed by their respective licenses. The GPLv3 license may impose copyleft restrictions on derivative works.

Limitations & Caveats

The project is not under active development. Optimized CUDA kernels for 1-bit matrix-vector multiplication are missing, impacting 3-bit inference speed. While adaptable to non-Llama architectures, it requires manual script modification.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 2 years ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.