tokenizers by huggingface

Fast tokenizer library optimized for research and production

Created 6 years ago

10,378 stars

Top 4.9% on SourcePulse

27 Experts Love This Project

clmnt

Clement Delangue

Cofounder of Hugging Face

syrusakbary

Founder of Wasmer

transitive-bullshit

Founder of Agentic

JohannesHa

Johannes Hagemann

Cofounder of Prime Intellect

and 23 more!

Project Summary

This library provides highly optimized tokenizers for natural language processing research and production, addressing the need for fast and versatile text processing. It's designed for developers and researchers working with large text datasets who require efficient pre-processing pipelines.

How It Works

The core of the library is implemented in Rust, ensuring exceptional performance for both training new vocabularies and tokenizing text. It supports popular algorithms like Byte-Pair Encoding (BPE), WordPiece, and Unigram. A key advantage is its ability to track alignments between original text and tokens after normalization, enabling precise mapping back to the source.

Quick Start & Requirements

Install via pip: pip install tokenizers or pip install git+https://github.com/huggingface/tokenizers.git#subdirectory=bindings/python
Python 3.x
Official documentation: https://huggingface.co/docs/tokenizers/index
Quick tour: https://huggingface.co/docs/tokenizers/quicktour

Highlighted Details

Implemented in Rust for high performance, capable of tokenizing 1GB of text in under 20 seconds on a server CPU.
Supports BPE, WordPiece, and Unigram models.
Offers comprehensive pre-processing: truncation, padding, and special token insertion.
Provides normalization with alignment tracking for mapping tokens back to original text.
Bindings available for Rust, Python, and Node.js, with a community-contributed Ruby binding.

Maintenance & Community

Developed and maintained by Hugging Face.
Active community support and development.

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided performance benchmarks are specific to a particular AWS instance and may vary across different hardware configurations.

Health Check

Last Commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)

8

Issues (30d)

12

Star History

92 stars in the last 30 days

Explore Similar Projects

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

2 more.

TokenDagger by M4THYOU

Fast tokenization for large-scale text processing

Created 6 months ago

Updated 6 months ago

rustbpe by karpathy

Efficient Rust library for BPE tokenizer training

Created 1 week ago

Updated 1 week ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

2 more.

tokenmonster by alasdairforsythe

Subword tokenizer and vocabulary trainer for multiple languages

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind), and

1 more.

text-splitter by benbrandt

Rust crate for splitting text into semantic chunks

Created 2 years ago

Updated 2 days ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic),

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm), and

1 more.

rust-tokenizers by guillaume-be

Rust library for high-performance tokenization in modern language models

Created 6 years ago

Updated 2 years ago

Starred by

Bryan Helmig

Bryan Helmig(Cofounder of Zapier).

llguidance by guidance-ai

Fast constrained decoding for LLMs

Created 1 year ago

Updated 1 month ago

Starred by

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind),

Bob van Luijt

Bob van Luijt(Cofounder of Weaviate), and

5 more.

awesome-huggingface by huggingface

Awesome list of open-source projects using Hugging Face

Created 4 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Andrew Kane

Andrew Kane(Author of pgvector), and

4 more.

BlingFire by microsoft

Fast text tokenization library

Created 6 years ago

Updated 1 year ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Brendan Falk

Brendan Falk(Cofounder of Fig), and

1 more.

aitextgen by minimaxir

Python tool for text-based AI training and generation

Created 6 years ago

Updated 2 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Carol Willing

Carol Willing(Core Contributor to CPython, Jupyter), and

14 more.

minbpe by karpathy

Minimal BPE encoder/decoder for LLM tokenization

Created 1 year ago

Updated 1 year ago

Starred by

Nat Friedman

Nat Friedman(Former CEO of GitHub),

Eric Zhang

Eric Zhang(Founding Engineer at Modal), and

31 more.

tiktoken by openai

Fast BPE tokenizer for OpenAI models

Created 3 years ago

Updated 3 months ago

Starred by

Lei Xu

Lei Xu(Cofounder of LanceDB) and

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen).

HanLP by hankcs

Multilingual NLP library for research/industry, built on PyTorch and TensorFlow

Created 11 years ago

Updated 1 month ago

Feedback? Help us improve.