nn_pruning by huggingface

Model pruning tool for efficient inference

Created 5 years ago

405 stars

Top 71.8% on SourcePulse

View on GitHub

7 Experts Love This Project

Clement Delangue

Cofounder of Hugging Face

Thomas Wolf

Cofounder of Hugging Face

Timo Möller

Cofounder of deepset

Luis Capelo

Cofounder of Lightning AI

and 3 more!

Project Summary

This library provides tools for applying movement pruning to neural networks, specifically focusing on achieving structured sparsity for improved inference speed. It targets researchers and practitioners in NLP and deep learning who need to compress large models like BERT while minimizing accuracy loss, enabling efficient deployment on resource-constrained devices.

How It Works

The library implements "Block Movement Pruning," an extension of movement pruning that creates structured sparsity patterns. This approach prunes weights in blocks, which are more amenable to hardware acceleration than unstructured sparsity. It explores semi-structured and structured variants, aiming to balance sparsity levels, accuracy, and inference speed. The method involves pruning during fine-tuning or training, allowing the network to adapt to the sparsity.

Quick Start & Requirements

Install via pip: python -m pip install -U nn_pruning
Developer install: git clone https://github.com/huggingface/nn_pruning.git then cd nn_pruning and python -m pip install -e ".[dev]"
Run tests with pytest nn_pruning
Requires Python and PyTorch. Specific CUDA versions are not explicitly stated but are implied for performance testing.
Documentation: https://huggingface.co/docs/nn_pruning/index

Highlighted Details

Achieves significant speedups (up to 2.8x on BERT-base for SQuAD v1) with minimal accuracy drop (e.g., -0.25 F1 for 65% parameter reduction).
Demonstrates that pruning larger models (BERT-large) can yield better results than pruning smaller ones, even at comparable final sizes.
Outperforms fine-tuned TinyBERT and DistilBERT in terms of F1 score vs. speedup.
Offers flexibility to choose trade-offs between speed and accuracy based on application needs.

Maintenance & Community

Developed by Hugging Face.
No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given it's a Hugging Face project, it's likely Apache 2.0, but this requires verification.
Compatible with Hugging Face Transformers library and PyTorch.

Limitations & Caveats

The "structured pruning" method can lead to a significant drop in F1 score.
Performance comparisons with MobileBERT are ongoing, and further hyperparameter tuning may be needed.
The pytorch_block_sparse CUDA implementation is not yet competitive with dense linear layers for speed.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days