bert.cpp  by skeskinen

C++ library for local BERT inference

created 2 years ago
495 stars

Top 63.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a pure C++ implementation of BERT for generating high-quality sentence embeddings, targeting developers needing efficient, CPU-bound NLP solutions. It leverages ggml for 4-bit integer quantization, enabling models like all-MiniLM-L6-v2 to be as small as 14MB with low RAM usage, offering a dependency-free alternative for embedding generation.

How It Works

The core of bert.cpp is its ggml-based inference engine, designed for efficient execution on CPUs across various architectures (x86, ARM). It supports 4-bit quantization of model weights, significantly reducing model size and memory footprint. The implementation focuses on mean pooling and normalization, mirroring SentenceTransformer behavior, but with a caveat that it does not strictly adhere to all tokenizer, pooling, or normalization settings specified in model cards.

Quick Start & Requirements

  • Install: Clone the repository and initialize submodules: git submodule update --init --recursive.
  • Models: Use pip3 install -r requirements.txt and python3 models/download-ggml.py to download models.
  • Build: Compile with CMake: mkdir build && cd build && cmake .. -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release && make.
  • Dependencies: Python 3, CMake, C++ compiler.
  • Resources: A 4-bit quantized MiniLM-L6-v2 model is ~14MB.
  • Docs: llama.cpp integration

Highlighted Details

  • Offers 4-bit quantization (q4_0, q4_1) with minimal accuracy loss compared to f32/f16.
  • Benchmarks show comparable or better evaluation times than sbert with batch_size=1 on CPU.
  • Supports conversion of Hugging Face models to ggml format using provided scripts.
  • Can run non-SentenceTransformer BERT models, though accuracy may vary.

Maintenance & Community

The project has been integrated into the more actively maintained llama.cpp project. Direct community links for bert.cpp are not prominently featured, but discussions can be found within the llama.cpp repository.

Licensing & Compatibility

The project appears to be under a permissive license, likely MIT, given its association with ggml and llama.cpp, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The tokenizer has known issues with Asian writing systems (CJK). Batching is a work-in-progress, impacting performance for multi-sentence inputs. The project does not fully respect all model-specific configuration parameters like pooling or normalization.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.