Fast tokenizer library optimized for research and production
Top 5.1% on sourcepulse
This library provides highly optimized tokenizers for natural language processing research and production, addressing the need for fast and versatile text processing. It's designed for developers and researchers working with large text datasets who require efficient pre-processing pipelines.
How It Works
The core of the library is implemented in Rust, ensuring exceptional performance for both training new vocabularies and tokenizing text. It supports popular algorithms like Byte-Pair Encoding (BPE), WordPiece, and Unigram. A key advantage is its ability to track alignments between original text and tokens after normalization, enabling precise mapping back to the source.
Quick Start & Requirements
pip install tokenizers
or pip install git+https://github.com/huggingface/tokenizers.git#subdirectory=bindings/python
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided performance benchmarks are specific to a particular AWS instance and may vary across different hardware configurations.
4 days ago
1 day