ByteTransformer by bytedance

High-performance BERT transformer inference on NVIDIA GPUs

Created 2 years ago

476 stars

Top 64.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ying Sheng

Coauthor of SGLang

Project Summary

ByteTransformer is a high-performance inference library for BERT-like transformers, targeting developers seeking to optimize inference serving on NVIDIA GPUs. It offers architectural-aware optimizations for padding-free BERT routines, delivering superior performance and reduced latency compared to standard implementations.

How It Works

This library provides both Python and C++ APIs, featuring a PyTorch plugin for easy integration. It implements end-to-end optimizations across key BERT components like QKV encoding, softmax, feed-forward networks, activation, layernorm, and multi-head attention, specifically targeting padding-free execution for efficiency.

Quick Start & Requirements

Installation: Build from source using CMake and make. Requires git submodule update --init, mkdir build && cd build, cmake -DTORCH_CUDA_ARCH_LIST="8.0" -DDataType=FP16 -DBUILD_THS=ON -DCUDAARCHS="80" .., and make.
Prerequisites: CUDA 11.6, CMake >= 3.13, PyTorch >= 1.8, Python >= 3.7. Requires GPU compute capability 7.0 (V100) / 7.5 (T4) or 8.0 (A100).
Resources: Setup involves compiling from source. Benchmark script benchmark/bert_bench.sh is available.
Docs/Demo: Technical details published at IEEE IPDPS 2023 and arXiv:2210.03052.

Highlighted Details

Achieves superior inference performance compared to PyTorch, TensorFlow, FasterTransformer, and DeepSpeed on A100 GPUs.
Demonstrates significant speedups, e.g., reducing latency for BERT inference with batch size 16 and sequence length 1024 from 53.21ms (PyTorch) to 24.70ms (ByteTransformer).
Optimizes padding-free BERT routines, including QKV, softmax, FFN, activation, layernorm, and multi-head attention.
Supports both fixed-length and variable-length transformer inputs.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The license type and any compatibility notes for commercial or closed-source use are not specified in the provided README.

Limitations & Caveats

Currently, only the standard BERT transformer encoder architecture is supported within this repository.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)