nn_pruning  by huggingface

Model pruning tool for efficient inference

created 4 years ago
403 stars

Top 73.0% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides tools for applying movement pruning to neural networks, specifically focusing on achieving structured sparsity for improved inference speed. It targets researchers and practitioners in NLP and deep learning who need to compress large models like BERT while minimizing accuracy loss, enabling efficient deployment on resource-constrained devices.

How It Works

The library implements "Block Movement Pruning," an extension of movement pruning that creates structured sparsity patterns. This approach prunes weights in blocks, which are more amenable to hardware acceleration than unstructured sparsity. It explores semi-structured and structured variants, aiming to balance sparsity levels, accuracy, and inference speed. The method involves pruning during fine-tuning or training, allowing the network to adapt to the sparsity.

Quick Start & Requirements

  • Install via pip: python -m pip install -U nn_pruning
  • Developer install: git clone https://github.com/huggingface/nn_pruning.git then cd nn_pruning and python -m pip install -e ".[dev]"
  • Run tests with pytest nn_pruning
  • Requires Python and PyTorch. Specific CUDA versions are not explicitly stated but are implied for performance testing.
  • Documentation: https://huggingface.co/docs/nn_pruning/index

Highlighted Details

  • Achieves significant speedups (up to 2.8x on BERT-base for SQuAD v1) with minimal accuracy drop (e.g., -0.25 F1 for 65% parameter reduction).
  • Demonstrates that pruning larger models (BERT-large) can yield better results than pruning smaller ones, even at comparable final sizes.
  • Outperforms fine-tuned TinyBERT and DistilBERT in terms of F1 score vs. speedup.
  • Offers flexibility to choose trade-offs between speed and accuracy based on application needs.

Maintenance & Community

  • Developed by Hugging Face.
  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Given it's a Hugging Face project, it's likely Apache 2.0, but this requires verification.
  • Compatible with Hugging Face Transformers library and PyTorch.

Limitations & Caveats

  • The "structured pruning" method can lead to a significant drop in F1 score.
  • Performance comparisons with MobileBERT are ongoing, and further hyperparameter tuning may be needed.
  • The pytorch_block_sparse CUDA implementation is not yet competitive with dense linear layers for speed.
Health Check
Last commit

3 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

wanda by locuslab

0%
782
LLM pruning research paper implementation
created 2 years ago
updated 11 months ago
Feedback? Help us improve.