flash-tokenizer  by NLPOptimize

CPU tokenizer library for LLM inference serving

Created 6 months ago
458 stars

Top 66.1% on SourcePulse

GitHubView on GitHub
Project Summary

FlashTokenizer is a high-performance C++ implementation of the BERT tokenizer, designed to accelerate LLM inference serving. It targets researchers and engineers seeking faster and more accurate tokenization than existing solutions like Hugging Face's BertTokenizerFast. The library offers significant speedups, claiming to be up to 10x faster than BertTokenizerFast while maintaining accuracy.

How It Works

FlashTokenizer is built in C++17, leveraging the LinMax Tokenizer algorithm for linear-time tokenization. It utilizes an AC Trie data structure and OpenMP for parallel processing at the C++ level, enabling efficient batch encoding. Optimizations include memory reduction techniques, branch pipelining, and the use of Bloom filters for punctuation, control, and whitespace.

Quick Start & Requirements

  • Install: pip install -U flash-tokenizer
  • Prerequisites: Python 3.8-3.13. For Windows, vc_redist.x64.exe is required. Compilation requires a C++17 compiler (g++, clang++, MSVC).
  • Setup: Installation via pip is straightforward. Building from source involves cloning the repository and running pip install . from the prj directory.
  • Documentation: https://github.com/NLPOptimize/flash-tokenizer

Highlighted Details

  • Claims up to 10x speed improvement over transformers.BertTokenizerFast.
  • Achieves 35K texts/sec on bert-base-uncased on a single CPU core.
  • Supports multiple languages including Chinese, Korean, and Japanese.
  • Implemented in C++ for performance and ease of maintenance.

Maintenance & Community

The project is actively developed with frequent updates noted in the README, including performance benchmarking and accuracy improvements. Links to community resources are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility with DL frameworks is detailed, with a preference for CUDA 11.8 across PyTorch and ONNX Runtime for broader GPU support.

Limitations & Caveats

The README mentions that the BidirectionalWordPieceTokenizer implementation is still a TODO item. While claiming high accuracy, specific benchmarks show a slight decrease compared to some other tokenizers for certain models. The project does not currently provide pre-built wheel packages for installation from source.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
28 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 1 day ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.