LLM pruning research paper implementation
Top 45.6% on sourcepulse
Wanda is a PyTorch implementation for pruning Large Language Models (LLMs) by jointly considering weight magnitudes and activation norms. It targets researchers and practitioners aiming to reduce LLM size and computational cost while maintaining performance, offering a simple yet effective alternative to pure magnitude pruning.
How It Works
Wanda prunes weights on a per-output basis, calculating importance as the product of weight magnitudes and the L2 norm of input activations. This approach aims to identify and remove weights that are less critical to the model's output, potentially leading to better sparsity-accuracy trade-offs compared to methods relying solely on weight magnitudes.
Quick Start & Requirements
pip install -e .
(after cloning).python main.py --model decapoda-research/llama-7b-hf --prune_method wanda --sparsity_ratio 0.5 --sparsity_type unstructured --save out/llama_7b/unstructured/wanda/
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The repository is built upon SparseGPT and focuses on LLM pruning; pruning for image classifiers is in a separate directory. Inference speedup for structured sparsity relies on PyTorch >= 2.1 and specific kernels (CUTLASS or CuSPARSELt).
11 months ago
1 day