Discover and explore top open-source AI tools and projects—updated daily.
Transformer inference runtime for CPU and GPU
Top 27.0% on SourcePulse
Summary
TurboTransformers is a high-performance runtime for accelerating transformer inference on CPU/GPU. It targets engineers and researchers needing efficient deployment of models like BERT/GPT2, offering significant speedups and simplified integration.
How It Works
This runtime integrates as a PyTorch plugin, delivering end-to-end acceleration with minimal code changes. Its "Smart Batching" feature optimizes inference for variable-length requests by minimizing zero-padding overhead. This approach, with optimized kernels, provides superior CPU/GPU performance and supports dynamic batch size/sequence length.
Quick Start & Requirements
Installation primarily uses Docker. Build CPU/GPU Docker images via provided scripts or pull pre-built images from Docker Hub (thufeifeibear/turbo_transformers_cpu:latest
, thufeifeibear/turbo_transformers_gpu:latest
). Inside Docker, compile the library and install the Python package. Building scripts have specific OS/dependency version requirements (PyTorch, CUDA) needing adjustment. Examples are in ./example/python
and ./example/cpp
.
Highlighted Details
Maintenance & Community
Open-sourced by WeChat AI, it has seen updates like smart batching (v0.6.0). Community support via QQ Group (1109315167) and WeChat. Future plans include low-precision model support.
Licensing & Compatibility
BSD 3-Clause License, generally permits commercial use and closed-source integration.
Limitations & Caveats
Numerical output may differ slightly from PyTorch due to approximate GELU. MKL performance can be suboptimal on PyTorch 1.5.0 (1.1.0 recommended). Concurrent onnxruntime-cpu==1.4.0
and onnxruntime-gpu==1.3.0
is unsupported. Building from source requires careful attention to OS/dependency versions.
2 months ago
Inactive