ARB-LLM by XIANGLONGYAN

LLM compression via advanced binarization

Created 1 year ago

1,015 stars

Top 36.8% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> ARB-LLM addresses the high memory and computational demands of Large Language Models (LLMs) by introducing a novel 1-bit post-training quantization (PTQ) technique. Targeting researchers and engineers deploying LLMs, it significantly reduces model size and resource requirements while preserving performance, offering a practical solution for efficient LLM deployment.

How It Works

ARB-LLM employs an Alternating Refined Binarization (ARB) algorithm to progressively update binarization parameters, effectively narrowing the distribution gap between binarized and full-precision weights and minimizing quantization error. Extensions like ARB-X and ARB-RC address specific LLM weight distribution characteristics, such as column deviation, further enhanced by a Column-Group Bitmap (CGB) strategy for refined weight partitioning. This approach yields superior compression and accuracy compared to prior methods.

Quick Start & Requirements

Clone the repository: git clone https://github.com/ZHITENGLI/ARB-LLM.git. Set up a Conda environment (conda create -n arbllm python=3.11, conda activate arbllm) and install dependencies (pip install torch torchvision torchaudio, pip install -r requirements.txt). GPU acceleration (CUDA) is required, as indicated by example commands using "cuda:0". Official repository: https://github.com/ZHITENGLI/ARB-LLM.git.

Highlighted Details

Achieves SOTA performance, outperforming previous binary PTQ methods like BiLLM across OPT model families.
ARB-LLM RC surpasses FP16 models of equivalent size, with binarized OPT-13B exhibiting memory comparable to FP16 OPT-2.7B but with superior performance.
Supports binarization for OPT, LLaMA, LLaMA-2, LLaMA-3, and Vicuna model families.
Evaluated using perplexity on WikiText2 and zero-shot QA accuracy via lm-evaluation-harness. Paper available at arXiv.

Maintenance & Community

Code released February 16, 2025, indicating recent activity. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README. The project is based on BiLLM.

Licensing & Compatibility

Released under the Apache 2.0 license. Compatibility notes may apply due to its foundation on BiLLM, which should be independently reviewed.

Limitations & Caveats

The provided README does not explicitly detail limitations, unsupported platforms, or known bugs. As a post-training quantization method, its performance characteristics may differ from quantization-aware training approaches. CUDA is a prerequisite.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days