C++ library for local BERT inference
Top 63.4% on sourcepulse
This project provides a pure C++ implementation of BERT for generating high-quality sentence embeddings, targeting developers needing efficient, CPU-bound NLP solutions. It leverages ggml for 4-bit integer quantization, enabling models like all-MiniLM-L6-v2 to be as small as 14MB with low RAM usage, offering a dependency-free alternative for embedding generation.
How It Works
The core of bert.cpp is its ggml-based inference engine, designed for efficient execution on CPUs across various architectures (x86, ARM). It supports 4-bit quantization of model weights, significantly reducing model size and memory footprint. The implementation focuses on mean pooling and normalization, mirroring SentenceTransformer behavior, but with a caveat that it does not strictly adhere to all tokenizer, pooling, or normalization settings specified in model cards.
Quick Start & Requirements
git submodule update --init --recursive
.pip3 install -r requirements.txt
and python3 models/download-ggml.py
to download models.mkdir build && cd build && cmake .. -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release && make
.Highlighted Details
Maintenance & Community
The project has been integrated into the more actively maintained llama.cpp
project. Direct community links for bert.cpp are not prominently featured, but discussions can be found within the llama.cpp
repository.
Licensing & Compatibility
The project appears to be under a permissive license, likely MIT, given its association with ggml and llama.cpp, allowing for commercial use and integration into closed-source projects.
Limitations & Caveats
The tokenizer has known issues with Asian writing systems (CJK). Batching is a work-in-progress, impacting performance for multi-sentence inputs. The project does not fully respect all model-specific configuration parameters like pooling or normalization.
1 year ago
1 day