cuBERT by zhihu

High-performance BERT inference library

Created 6 years ago

545 stars

Top 58.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Forrest Iandola

Author of SqueezeNet; Research Scientist at Meta

Project Summary

Summary

zhihu/cuBERT offers a highly optimized inference engine for BERT models, targeting users needing maximum performance on NVIDIA GPUs (CUDA/CUBLAS) and Intel CPUs (MKL). By bypassing framework overhead like TensorFlow, it delivers significantly reduced latency and increased throughput for BERT applications, especially in production serving.

How It Works

cuBERT implements BERT inference directly using low-level CUDA and MKL libraries, avoiding the overhead of full deep learning frameworks. This custom, optimized approach provides substantial speedups. On NVIDIA GPUs, it leverages Tensor Cores on Volta/Turing architectures for mixed-precision computation, achieving over 2x acceleration with minimal accuracy loss. The engine supports standard BERT pooling and various output types.

Quick Start & Requirements

Installation: Build from source via CMake (e.g., -DcuBERT_ENABLE_GPU=ON), then make install. Python and Java wrappers are available. Pre-built Python wheels for MKL on Linux can be installed via pip.
Prerequisites: Requires CUDA Toolkit (e.g., v9.0), Intel MKL, Protobuf-c. TensorFlow C API needed for benchmarks.
Setup: No specific time estimates or links to docs/demos are provided.

Highlighted Details

Performance: Benchmarks show cuBERT significantly outperforming TensorFlow on both GPU (e.g., 184.6ms vs 255.2ms batch 128) and CPU (e.g., 984.9ms vs 1504.0ms batch 128).
Mixed Precision: Achieves >2x speedup on NVIDIA Volta/Turing GPUs using Tensor Cores (fp16 storage, fp32 compute) with <1% accuracy error.
API: Supports standard BERT pooling (first token hidden state) and average pooling, with outputs including logits, probabilities, pooled/sequence/embedding outputs.

Maintenance & Community

Authors: fanliwen, wangruixin, fangkuan, sunxian.
Community/Support: No community channels or roadmap information are mentioned.

Licensing & Compatibility

License: The specific open-source license is not stated in the README, a critical omission for adoption evaluation.
Compatibility: Primarily tied to specific CUDA and MKL versions. No notes on commercial use or linking.

Limitations & Caveats

Scope: Exclusively supports BERT (Transformer) models.
CPU Optimization: Optimal CPU performance requires careful tuning of environment variables (OMP_NUM_THREADS, MKL_NUM_THREADS, CUBERT_NUM_CPU_MODELS) for balancing parallelism levels.
Dependencies: Relies on specific CUDA and MKL versions, potentially limiting compatibility.

Health Check

Last Commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days