tokenizers  by huggingface

Fast tokenizer library optimized for research and production

created 5 years ago
9,948 stars

Top 5.1% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides highly optimized tokenizers for natural language processing research and production, addressing the need for fast and versatile text processing. It's designed for developers and researchers working with large text datasets who require efficient pre-processing pipelines.

How It Works

The core of the library is implemented in Rust, ensuring exceptional performance for both training new vocabularies and tokenizing text. It supports popular algorithms like Byte-Pair Encoding (BPE), WordPiece, and Unigram. A key advantage is its ability to track alignments between original text and tokens after normalization, enabling precise mapping back to the source.

Quick Start & Requirements

Highlighted Details

  • Implemented in Rust for high performance, capable of tokenizing 1GB of text in under 20 seconds on a server CPU.
  • Supports BPE, WordPiece, and Unigram models.
  • Offers comprehensive pre-processing: truncation, padding, and special token insertion.
  • Provides normalization with alignment tracking for mapping tokens back to original text.
  • Bindings available for Rust, Python, and Node.js, with a community-contributed Ruby binding.

Maintenance & Community

  • Developed and maintained by Hugging Face.
  • Active community support and development.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided performance benchmarks are specific to a particular AWS instance and may vary across different hardware configurations.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
9
Issues (30d)
17
Star History
334 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.