Transformer compression via SliceGPT (ICLR'24)
Top 69.4% on sourcepulse
This repository provides SliceGPT, a post-training sparsification technique for transformer models, including Large Language Models (LLMs). It enables users to reduce model size and improve inference speed by applying orthogonal transformations and slicing weight matrices, without altering the model's architecture. The primary audience includes researchers and practitioners seeking to optimize transformer deployments.
How It Works
SliceGPT applies orthogonal transformations to each transformer layer, followed by slicing off the least significant rows and columns of weight matrices based on eigenvalue decay. This process replaces dense weight matrices with smaller ones, reducing embedding dimensions and thus memory footprint and latency. The method is designed to maintain model performance while achieving significant compression.
Quick Start & Requirements
pip install -e .[experiment]
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
6 months ago
Inactive