SpQR by Vahe1994

Weight compression research paper for near-lossless LLM quantization

Created 2 years ago

551 stars

Top 58.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Artidoro Pagnoni

Coauthor of QLoRA; Research Scientist at Meta

Jeff Hammerbacher

Cofounder of Cloudera

Pawel Garbacki

Cofounder of Fireworks AI

Luis Capelo

Cofounder of Lightning AI

Project Summary

SpQR offers near-lossless compression for Large Language Models (LLMs) by employing a novel sparse-quantized representation. This repository is for researchers and engineers seeking to reduce LLM memory footprint and potentially accelerate inference, targeting models like LLaMA, Falcon, and OPT.

How It Works

SpQR achieves compression through a two-stage process: first, it identifies and isolates outlier weights, then it quantizes the remaining weights using a sparse-quantized format. This approach preserves model accuracy by treating outliers separately, leading to near-lossless compression. The method is detailed in the paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression."

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: torch>=2.0.0 with CUDA support. Transformers version 4.28.dev0 (commit 464d420775) is recommended for reproducibility.
Resources: Tested on a single A100 80GB GPU. Perplexity evaluation up to LLaMA-65B/Falcon-40B is possible on GPUs with 32GB+ VRAM. Offloading activations can reduce VRAM requirements to 24GB+ for Llama 65B and 6GB+ for Llama 7B, with a slight performance impact. RAM requirements can be substantial (e.g., ~130GB for Llama 65B) to hold uncompressed weights and datasets.
Links: Research Paper

Highlighted Details

Supports LLaMA, Falcon, and OPT model families.
Includes scripts for perplexity benchmarks and zero-shot evaluation using a modified LM Evaluation Harness.
Provides an efficient CUDA kernel implementation for SpQR matvec for inference.
Offers conversion scripts for legacy SpQR formats to optimized storage and Hugging Face compatibility.

Maintenance & Community

No specific community channels or active maintainer information is provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README states that efficient inference code will be added soon, indicating it's not fully implemented. The current version of the evaluation script only supports LLaMA/Falcon quantization. The setup and benchmarking require significant GPU and system RAM resources.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days