TurboTransformers  by Tencent

Transformer inference runtime for CPU and GPU

Created 5 years ago
1,532 stars

Top 27.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

TurboTransformers is a high-performance runtime for accelerating transformer inference on CPU/GPU. It targets engineers and researchers needing efficient deployment of models like BERT/GPT2, offering significant speedups and simplified integration.

How It Works

This runtime integrates as a PyTorch plugin, delivering end-to-end acceleration with minimal code changes. Its "Smart Batching" feature optimizes inference for variable-length requests by minimizing zero-padding overhead. This approach, with optimized kernels, provides superior CPU/GPU performance and supports dynamic batch size/sequence length.

Quick Start & Requirements

Installation primarily uses Docker. Build CPU/GPU Docker images via provided scripts or pull pre-built images from Docker Hub (thufeifeibear/turbo_transformers_cpu:latest, thufeifeibear/turbo_transformers_gpu:latest). Inside Docker, compile the library and install the Python package. Building scripts have specific OS/dependency version requirements (PyTorch, CUDA) needing adjustment. Examples are in ./example/python and ./example/cpp.

Highlighted Details

  • Performance: Claims "Fastest/Fastest" inference vs. PyTorch JIT, TensorRT, ONNX-Runtime. Real-world Tencent deployments achieved 1.88x-13.6x acceleration.
  • Supported Models: BERT, ALBERT, Roberta, Transformer Decoder, GPT2.
  • Smart Batching: Minimizes zero-padding waste for variable-length inputs.
  • Usability: Python/C++ APIs for easy integration.
  • Tensor Core Support: Optional FP16 acceleration on GPUs via recompilation.

Maintenance & Community

Open-sourced by WeChat AI, it has seen updates like smart batching (v0.6.0). Community support via QQ Group (1109315167) and WeChat. Future plans include low-precision model support.

Licensing & Compatibility

BSD 3-Clause License, generally permits commercial use and closed-source integration.

Limitations & Caveats

Numerical output may differ slightly from PyTorch due to approximate GELU. MKL performance can be suboptimal on PyTorch 1.5.0 (1.1.0 recommended). Concurrent onnxruntime-cpu==1.4.0 and onnxruntime-gpu==1.3.0 is unsupported. Building from source requires careful attention to OS/dependency versions.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.