cuBERT  by zhihu

High-performance BERT inference library

Created 6 years ago
546 stars

Top 58.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

zhihu/cuBERT offers a highly optimized inference engine for BERT models, targeting users needing maximum performance on NVIDIA GPUs (CUDA/CUBLAS) and Intel CPUs (MKL). By bypassing framework overhead like TensorFlow, it delivers significantly reduced latency and increased throughput for BERT applications, especially in production serving.

How It Works

cuBERT implements BERT inference directly using low-level CUDA and MKL libraries, avoiding the overhead of full deep learning frameworks. This custom, optimized approach provides substantial speedups. On NVIDIA GPUs, it leverages Tensor Cores on Volta/Turing architectures for mixed-precision computation, achieving over 2x acceleration with minimal accuracy loss. The engine supports standard BERT pooling and various output types.

Quick Start & Requirements

  • Installation: Build from source via CMake (e.g., -DcuBERT_ENABLE_GPU=ON), then make install. Python and Java wrappers are available. Pre-built Python wheels for MKL on Linux can be installed via pip.
  • Prerequisites: Requires CUDA Toolkit (e.g., v9.0), Intel MKL, Protobuf-c. TensorFlow C API needed for benchmarks.
  • Setup: No specific time estimates or links to docs/demos are provided.

Highlighted Details

  • Performance: Benchmarks show cuBERT significantly outperforming TensorFlow on both GPU (e.g., 184.6ms vs 255.2ms batch 128) and CPU (e.g., 984.9ms vs 1504.0ms batch 128).
  • Mixed Precision: Achieves >2x speedup on NVIDIA Volta/Turing GPUs using Tensor Cores (fp16 storage, fp32 compute) with <1% accuracy error.
  • API: Supports standard BERT pooling (first token hidden state) and average pooling, with outputs including logits, probabilities, pooled/sequence/embedding outputs.

Maintenance & Community

  • Authors: fanliwen, wangruixin, fangkuan, sunxian.
  • Community/Support: No community channels or roadmap information are mentioned.

Licensing & Compatibility

  • License: The specific open-source license is not stated in the README, a critical omission for adoption evaluation.
  • Compatibility: Primarily tied to specific CUDA and MKL versions. No notes on commercial use or linking.

Limitations & Caveats

  • Scope: Exclusively supports BERT (Transformer) models.
  • CPU Optimization: Optimal CPU performance requires careful tuning of environment variables (OMP_NUM_THREADS, MKL_NUM_THREADS, CUBERT_NUM_CPU_MODELS) for balancing parallelism levels.
  • Dependencies: Relies on specific CUDA and MKL versions, potentially limiting compatibility.
Health Check
Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.