Discover and explore top open-source AI tools and projects—updated daily.
High-performance BERT inference library
Top 58.6% on SourcePulse
Summary
zhihu/cuBERT offers a highly optimized inference engine for BERT models, targeting users needing maximum performance on NVIDIA GPUs (CUDA/CUBLAS) and Intel CPUs (MKL). By bypassing framework overhead like TensorFlow, it delivers significantly reduced latency and increased throughput for BERT applications, especially in production serving.
How It Works
cuBERT implements BERT inference directly using low-level CUDA and MKL libraries, avoiding the overhead of full deep learning frameworks. This custom, optimized approach provides substantial speedups. On NVIDIA GPUs, it leverages Tensor Cores on Volta/Turing architectures for mixed-precision computation, achieving over 2x acceleration with minimal accuracy loss. The engine supports standard BERT pooling and various output types.
Quick Start & Requirements
-DcuBERT_ENABLE_GPU=ON
), then make install
. Python and Java wrappers are available. Pre-built Python wheels for MKL on Linux can be installed via pip.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
OMP_NUM_THREADS
, MKL_NUM_THREADS
, CUBERT_NUM_CPU_MODELS
) for balancing parallelism levels.4 years ago
Inactive