flash-tokenizer  by NLPOptimize

CPU tokenizer library for LLM inference serving

created 4 months ago
464 stars

Top 66.3% on sourcepulse

GitHubView on GitHub
Project Summary

FlashTokenizer is a high-performance C++ implementation of the BERT tokenizer, designed to accelerate LLM inference serving. It targets researchers and engineers seeking faster and more accurate tokenization than existing solutions like Hugging Face's BertTokenizerFast. The library offers significant speedups, claiming to be up to 10x faster than BertTokenizerFast while maintaining accuracy.

How It Works

FlashTokenizer is built in C++17, leveraging the LinMax Tokenizer algorithm for linear-time tokenization. It utilizes an AC Trie data structure and OpenMP for parallel processing at the C++ level, enabling efficient batch encoding. Optimizations include memory reduction techniques, branch pipelining, and the use of Bloom filters for punctuation, control, and whitespace.

Quick Start & Requirements

  • Install: pip install -U flash-tokenizer
  • Prerequisites: Python 3.8-3.13. For Windows, vc_redist.x64.exe is required. Compilation requires a C++17 compiler (g++, clang++, MSVC).
  • Setup: Installation via pip is straightforward. Building from source involves cloning the repository and running pip install . from the prj directory.
  • Documentation: https://github.com/NLPOptimize/flash-tokenizer

Highlighted Details

  • Claims up to 10x speed improvement over transformers.BertTokenizerFast.
  • Achieves 35K texts/sec on bert-base-uncased on a single CPU core.
  • Supports multiple languages including Chinese, Korean, and Japanese.
  • Implemented in C++ for performance and ease of maintenance.

Maintenance & Community

The project is actively developed with frequent updates noted in the README, including performance benchmarking and accuracy improvements. Links to community resources are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility with DL frameworks is detailed, with a preference for CUDA 11.8 across PyTorch and ONNX Runtime for broader GPU support.

Limitations & Caveats

The README mentions that the BidirectionalWordPieceTokenizer implementation is still a TODO item. While claiming high accuracy, specific benchmarks show a slight decrease compared to some other tokenizers for certain models. The project does not currently provide pre-built wheel packages for installation from source.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
227 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.