SpQR  by Vahe1994

Weight compression research paper for near-lossless LLM quantization

created 2 years ago
546 stars

Top 59.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SpQR offers near-lossless compression for Large Language Models (LLMs) by employing a novel sparse-quantized representation. This repository is for researchers and engineers seeking to reduce LLM memory footprint and potentially accelerate inference, targeting models like LLaMA, Falcon, and OPT.

How It Works

SpQR achieves compression through a two-stage process: first, it identifies and isolates outlier weights, then it quantizes the remaining weights using a sparse-quantized format. This approach preserves model accuracy by treating outliers separately, leading to near-lossless compression. The method is detailed in the paper "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression."

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: torch>=2.0.0 with CUDA support. Transformers version 4.28.dev0 (commit 464d420775) is recommended for reproducibility.
  • Resources: Tested on a single A100 80GB GPU. Perplexity evaluation up to LLaMA-65B/Falcon-40B is possible on GPUs with 32GB+ VRAM. Offloading activations can reduce VRAM requirements to 24GB+ for Llama 65B and 6GB+ for Llama 7B, with a slight performance impact. RAM requirements can be substantial (e.g., ~130GB for Llama 65B) to hold uncompressed weights and datasets.
  • Links: Research Paper

Highlighted Details

  • Supports LLaMA, Falcon, and OPT model families.
  • Includes scripts for perplexity benchmarks and zero-shot evaluation using a modified LM Evaluation Harness.
  • Provides an efficient CUDA kernel implementation for SpQR matvec for inference.
  • Offers conversion scripts for legacy SpQR formats to optimized storage and Hugging Face compatibility.

Maintenance & Community

No specific community channels or active maintainer information is provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README states that efficient inference code will be added soon, indicating it's not fully implemented. The current version of the evaluation script only supports LLaMA/Falcon quantization. The setup and benchmarking require significant GPU and system RAM resources.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

AQLM by Vahe1994

0.1%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
created 1 year ago
updated 2 months ago
Feedback? Help us improve.