ARB-LLM  by XIANGLONGYAN

LLM compression via advanced binarization

Created 11 months ago
1,349 stars

Top 29.9% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> ARB-LLM addresses the high memory and computational demands of Large Language Models (LLMs) by introducing a novel 1-bit post-training quantization (PTQ) technique. Targeting researchers and engineers deploying LLMs, it significantly reduces model size and resource requirements while preserving performance, offering a practical solution for efficient LLM deployment.

How It Works

ARB-LLM employs an Alternating Refined Binarization (ARB) algorithm to progressively update binarization parameters, effectively narrowing the distribution gap between binarized and full-precision weights and minimizing quantization error. Extensions like ARB-X and ARB-RC address specific LLM weight distribution characteristics, such as column deviation, further enhanced by a Column-Group Bitmap (CGB) strategy for refined weight partitioning. This approach yields superior compression and accuracy compared to prior methods.

Quick Start & Requirements

Clone the repository: git clone https://github.com/ZHITENGLI/ARB-LLM.git. Set up a Conda environment (conda create -n arbllm python=3.11, conda activate arbllm) and install dependencies (pip install torch torchvision torchaudio, pip install -r requirements.txt). GPU acceleration (CUDA) is required, as indicated by example commands using "cuda:0". Official repository: https://github.com/ZHITENGLI/ARB-LLM.git.

Highlighted Details

  • Achieves SOTA performance, outperforming previous binary PTQ methods like BiLLM across OPT model families.
  • ARB-LLM RC surpasses FP16 models of equivalent size, with binarized OPT-13B exhibiting memory comparable to FP16 OPT-2.7B but with superior performance.
  • Supports binarization for OPT, LLaMA, LLaMA-2, LLaMA-3, and Vicuna model families.
  • Evaluated using perplexity on WikiText2 and zero-shot QA accuracy via lm-evaluation-harness. Paper available at arXiv.

Maintenance & Community

Code released February 16, 2025, indicating recent activity. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README. The project is based on BiLLM.

Licensing & Compatibility

Released under the Apache 2.0 license. Compatibility notes may apply due to its foundation on BiLLM, which should be independently reviewed.

Limitations & Caveats

The provided README does not explicitly detail limitations, unsupported platforms, or known bugs. As a post-training quantization method, its performance characteristics may differ from quantization-aware training approaches. CUDA is a prerequisite.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1,565 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

gptq by IST-DASLab

0.1%
2k
Code for GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers
Created 3 years ago
Updated 1 year ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.2%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
5 more.

GPTQ-for-LLaMa by qwopqwop200

0.0%
3k
4-bit quantization for LLaMA models using GPTQ
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.