CPU tokenizer library for LLM inference serving
Top 66.3% on sourcepulse
FlashTokenizer is a high-performance C++ implementation of the BERT tokenizer, designed to accelerate LLM inference serving. It targets researchers and engineers seeking faster and more accurate tokenization than existing solutions like Hugging Face's BertTokenizerFast
. The library offers significant speedups, claiming to be up to 10x faster than BertTokenizerFast
while maintaining accuracy.
How It Works
FlashTokenizer is built in C++17, leveraging the LinMax Tokenizer algorithm for linear-time tokenization. It utilizes an AC Trie data structure and OpenMP for parallel processing at the C++ level, enabling efficient batch encoding. Optimizations include memory reduction techniques, branch pipelining, and the use of Bloom filters for punctuation, control, and whitespace.
Quick Start & Requirements
pip install -U flash-tokenizer
vc_redist.x64.exe
is required. Compilation requires a C++17 compiler (g++, clang++, MSVC).pip install .
from the prj
directory.Highlighted Details
transformers.BertTokenizerFast
.bert-base-uncased
on a single CPU core.Maintenance & Community
The project is actively developed with frequent updates noted in the README, including performance benchmarking and accuracy improvements. Links to community resources are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility with DL frameworks is detailed, with a preference for CUDA 11.8 across PyTorch and ONNX Runtime for broader GPU support.
Limitations & Caveats
The README mentions that the BidirectionalWordPieceTokenizer
implementation is still a TODO item. While claiming high accuracy, specific benchmarks show a slight decrease compared to some other tokenizers for certain models. The project does not currently provide pre-built wheel packages for installation from source.
2 months ago
Inactive