TurboTransformers by Tencent

Transformer inference runtime for CPU and GPU

Created 5 years ago

1,534 stars

Top 26.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

Summary

TurboTransformers is a high-performance runtime for accelerating transformer inference on CPU/GPU. It targets engineers and researchers needing efficient deployment of models like BERT/GPT2, offering significant speedups and simplified integration.

How It Works

This runtime integrates as a PyTorch plugin, delivering end-to-end acceleration with minimal code changes. Its "Smart Batching" feature optimizes inference for variable-length requests by minimizing zero-padding overhead. This approach, with optimized kernels, provides superior CPU/GPU performance and supports dynamic batch size/sequence length.

Quick Start & Requirements

Installation primarily uses Docker. Build CPU/GPU Docker images via provided scripts or pull pre-built images from Docker Hub (thufeifeibear/turbo_transformers_cpu:latest, thufeifeibear/turbo_transformers_gpu:latest). Inside Docker, compile the library and install the Python package. Building scripts have specific OS/dependency version requirements (PyTorch, CUDA) needing adjustment. Examples are in ./example/python and ./example/cpp.

Highlighted Details

Performance: Claims "Fastest/Fastest" inference vs. PyTorch JIT, TensorRT, ONNX-Runtime. Real-world Tencent deployments achieved 1.88x-13.6x acceleration.
Supported Models: BERT, ALBERT, Roberta, Transformer Decoder, GPT2.
Smart Batching: Minimizes zero-padding waste for variable-length inputs.
Usability: Python/C++ APIs for easy integration.
Tensor Core Support: Optional FP16 acceleration on GPUs via recompilation.

Maintenance & Community

Open-sourced by WeChat AI, it has seen updates like smart batching (v0.6.0). Community support via QQ Group (1109315167) and WeChat. Future plans include low-precision model support.

Licensing & Compatibility

BSD 3-Clause License, generally permits commercial use and closed-source integration.

Limitations & Caveats

Numerical output may differ slightly from PyTorch due to approximate GELU. MKL performance can be suboptimal on PyTorch 1.5.0 (1.1.0 recommended). Concurrent onnxruntime-cpu==1.4.0 and onnxruntime-gpu==1.3.0 is unsupported. Building from source requires careful attention to OS/dependency versions.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days