TurboTransformers  by Tencent

Transformer inference runtime for CPU and GPU

Created 5 years ago
1,534 stars

Top 26.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

TurboTransformers is a high-performance runtime for accelerating transformer inference on CPU/GPU. It targets engineers and researchers needing efficient deployment of models like BERT/GPT2, offering significant speedups and simplified integration.

How It Works

This runtime integrates as a PyTorch plugin, delivering end-to-end acceleration with minimal code changes. Its "Smart Batching" feature optimizes inference for variable-length requests by minimizing zero-padding overhead. This approach, with optimized kernels, provides superior CPU/GPU performance and supports dynamic batch size/sequence length.

Quick Start & Requirements

Installation primarily uses Docker. Build CPU/GPU Docker images via provided scripts or pull pre-built images from Docker Hub (thufeifeibear/turbo_transformers_cpu:latest, thufeifeibear/turbo_transformers_gpu:latest). Inside Docker, compile the library and install the Python package. Building scripts have specific OS/dependency version requirements (PyTorch, CUDA) needing adjustment. Examples are in ./example/python and ./example/cpp.

Highlighted Details

  • Performance: Claims "Fastest/Fastest" inference vs. PyTorch JIT, TensorRT, ONNX-Runtime. Real-world Tencent deployments achieved 1.88x-13.6x acceleration.
  • Supported Models: BERT, ALBERT, Roberta, Transformer Decoder, GPT2.
  • Smart Batching: Minimizes zero-padding waste for variable-length inputs.
  • Usability: Python/C++ APIs for easy integration.
  • Tensor Core Support: Optional FP16 acceleration on GPUs via recompilation.

Maintenance & Community

Open-sourced by WeChat AI, it has seen updates like smart batching (v0.6.0). Community support via QQ Group (1109315167) and WeChat. Future plans include low-precision model support.

Licensing & Compatibility

BSD 3-Clause License, generally permits commercial use and closed-source integration.

Limitations & Caveats

Numerical output may differ slightly from PyTorch due to approximate GELU. MKL performance can be suboptimal on PyTorch 1.5.0 (1.1.0 recommended). Concurrent onnxruntime-cpu==1.4.0 and onnxruntime-gpu==1.3.0 is unsupported. Building from source requires careful attention to OS/dependency versions.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena), and
13 more.

awesome-tensor-compilers by merrymercy

0.4%
3k
Curated list of tensor compiler projects and papers
Created 5 years ago
Updated 1 year ago
Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.2%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

simpletransformers by ThilinaRajapakse

0.0%
4k
Rapid NLP task implementation
Created 6 years ago
Updated 3 months ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 3 weeks ago
Starred by Vaibhav Nivargi Vaibhav Nivargi(Cofounder of Moveworks), Chuan Li Chuan Li(Chief Scientific Officer at Lambda), and
5 more.

awesome-mlops by visenger

0.1%
13k
Curated MLOps knowledge hub
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.