LLM-Pruner by horseee

LLM structural pruner for model compression

Created 2 years ago

1,097 stars

Top 34.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

LLM-Pruner offers structural pruning for large language models, enabling significant compression with minimal performance degradation. It targets researchers and practitioners aiming to reduce the computational footprint of LLMs like Llama, BLOOM, and Vicuna, facilitating deployment on resource-constrained environments.

How It Works

LLM-Pruner employs a three-stage process: Discovery, Estimation, and Recovery. The Discovery stage identifies minimally-removable structural units within the LLM. The Estimation stage quantizes the importance of these units using criteria like Taylor expansion or L1/L2 norms. Finally, the Recovery stage uses efficient post-training on datasets like Alpaca or LaMini-instruction to restore model performance. This approach allows for task-agnostic compression and efficient fine-tuning.

Quick Start & Requirements

Install via pip install -r requirement.txt.
Requires Python, PyTorch, and lm-evaluation-harness. GPU is recommended for Taylor-based pruning and evaluation.
The script/llama_prune.sh script automates downloading models and datasets for a minimal example.
Official quick-start and detailed instructions are available in the README.

Highlighted Details

Supports a wide range of LLMs including Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, and Baichuan.
Achieves efficient compression, with reported pruning and post-training times of 3 minutes and 3 hours, respectively.
Demonstrates competitive performance, with a fine-tuned LLaMA-5.4B model approaching the original LLaMA-7B's accuracy using only 50k samples.
Includes support for Grouped Query Attention (GQA) for Llama-3/3.1.

Maintenance & Community

The project is associated with the National University of Singapore.
Updates are regularly posted, indicating active development.
A WeChat group is available for community discussion.

Licensing & Compatibility

The project is released under a permissive license, allowing for commercial use and integration with closed-source projects.
Specific model checkpoints used in experiments might have different licensing terms.

Limitations & Caveats

The project notes that while efficient, the compressed models can still exhibit issues like repetitive token generation or nonsensical outputs, indicating room for quality improvement. Manual intervention may be required for certain model architectures to map index concatenations.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days