Discover and explore top open-source AI tools and projects—updated daily.
bytedanceHigh-performance BERT transformer inference on NVIDIA GPUs
Top 63.9% on SourcePulse
ByteTransformer is a high-performance inference library for BERT-like transformers, targeting developers seeking to optimize inference serving on NVIDIA GPUs. It offers architectural-aware optimizations for padding-free BERT routines, delivering superior performance and reduced latency compared to standard implementations.
How It Works
This library provides both Python and C++ APIs, featuring a PyTorch plugin for easy integration. It implements end-to-end optimizations across key BERT components like QKV encoding, softmax, feed-forward networks, activation, layernorm, and multi-head attention, specifically targeting padding-free execution for efficiency.
Quick Start & Requirements
make. Requires git submodule update --init, mkdir build && cd build, cmake -DTORCH_CUDA_ARCH_LIST="8.0" -DDataType=FP16 -DBUILD_THS=ON -DCUDAARCHS="80" .., and make.benchmark/bert_bench.sh is available.Highlighted Details
Maintenance & Community
The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.
Licensing & Compatibility
The license type and any compatibility notes for commercial or closed-source use are not specified in the provided README.
Limitations & Caveats
Currently, only the standard BERT transformer encoder architecture is supported within this repository.
1 year ago
Inactive
Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab),
google
grahamjenson
google-research
triton-inference-server
tensorflow