NLU optimization recipes for transformer models
Top 49.5% on sourcepulse
FastFormers provides methods and recipes for highly efficient Transformer model inference for Natural Language Understanding (NLU) tasks. It targets researchers and engineers seeking significant speed-ups on CPU and GPU, demonstrating up to 233x speed improvement on CPU for multi-head self-attention architectures.
How It Works
The project leverages techniques like knowledge distillation, structured pruning (reducing heads and FFN dimensions), and 8-bit integer quantization via ONNX Runtime for CPU optimization. For GPU, it supports 16-bit floating-point precision. The core approach involves creating smaller, faster student models from larger teacher models, often with modifications to activation functions and architectural elements.
Quick Start & Requirements
pip install onnxruntime==1.8.0 --user --upgrade --no-deps --force-reinstall
, pip uninstall transformers -y
, git clone https://github.com/microsoft/fastformers
, cd fastformers
, pip install .
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
transformers
package due to customized versions.4 months ago
Inactive