ByteTransformer  by bytedance

High-performance BERT transformer inference on NVIDIA GPUs

Created 2 years ago
479 stars

Top 63.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ByteTransformer is a high-performance inference library for BERT-like transformers, targeting developers seeking to optimize inference serving on NVIDIA GPUs. It offers architectural-aware optimizations for padding-free BERT routines, delivering superior performance and reduced latency compared to standard implementations.

How It Works

This library provides both Python and C++ APIs, featuring a PyTorch plugin for easy integration. It implements end-to-end optimizations across key BERT components like QKV encoding, softmax, feed-forward networks, activation, layernorm, and multi-head attention, specifically targeting padding-free execution for efficiency.

Quick Start & Requirements

  • Installation: Build from source using CMake and make. Requires git submodule update --init, mkdir build && cd build, cmake -DTORCH_CUDA_ARCH_LIST="8.0" -DDataType=FP16 -DBUILD_THS=ON -DCUDAARCHS="80" .., and make.
  • Prerequisites: CUDA 11.6, CMake >= 3.13, PyTorch >= 1.8, Python >= 3.7. Requires GPU compute capability 7.0 (V100) / 7.5 (T4) or 8.0 (A100).
  • Resources: Setup involves compiling from source. Benchmark script benchmark/bert_bench.sh is available.
  • Docs/Demo: Technical details published at IEEE IPDPS 2023 and arXiv:2210.03052.

Highlighted Details

  • Achieves superior inference performance compared to PyTorch, TensorFlow, FasterTransformer, and DeepSpeed on A100 GPUs.
  • Demonstrates significant speedups, e.g., reducing latency for BERT inference with batch size 16 and sequence length 1024 from 53.21ms (PyTorch) to 24.70ms (ByteTransformer).
  • Optimizes padding-free BERT routines, including QKV, softmax, FFN, activation, layernorm, and multi-head attention.
  • Supports both fixed-length and variable-length transformer inputs.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The license type and any compatibility notes for commercial or closed-source use are not specified in the provided README.

Limitations & Caveats

Currently, only the standard BERT transformer encoder architecture is supported within this repository.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.