fastformers  by microsoft

NLU optimization recipes for transformer models

created 5 years ago
705 stars

Top 49.5% on sourcepulse

GitHubView on GitHub
Project Summary

FastFormers provides methods and recipes for highly efficient Transformer model inference for Natural Language Understanding (NLU) tasks. It targets researchers and engineers seeking significant speed-ups on CPU and GPU, demonstrating up to 233x speed improvement on CPU for multi-head self-attention architectures.

How It Works

The project leverages techniques like knowledge distillation, structured pruning (reducing heads and FFN dimensions), and 8-bit integer quantization via ONNX Runtime for CPU optimization. For GPU, it supports 16-bit floating-point precision. The core approach involves creating smaller, faster student models from larger teacher models, often with modifications to activation functions and architectural elements.

Quick Start & Requirements

  • Installation: pip install onnxruntime==1.8.0 --user --upgrade --no-deps --force-reinstall, pip uninstall transformers -y, git clone https://github.com/microsoft/fastformers, cd fastformers, pip install .
  • Prerequisites: Linux OS, Python 3.6/3.7, PyTorch 1.5.0+, ONNX Runtime 1.8.0+. CPU requires AVX2/AVX512 (AVX512 recommended for full speed). GPU requires Volta or later for 16-bit float support.
  • Demo: Requires downloading SuperGLUE dataset and demo model files.
  • Docs: FastFormers Paper

Highlighted Details

  • Claims up to 233.87x speed-up on CPU for Transformer architectures.
  • Supports knowledge distillation, pruning, and 8-bit quantization.
  • Integrates with Hugging Face Transformers and ONNX Runtime.
  • Reproducible results from the FastFormers paper are available.

Maintenance & Community

  • Actively worked with Hugging Face and ONNX Runtime teams.
  • Adopted the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

  • Currently supports only Linux operating systems.
  • Requires uninstalling existing transformers package due to customized versions.
  • GPU 16-bit float optimization requires specific hardware (Volta+).
Health Check
Last commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 15 hours ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.