rust-tokenizers  by guillaume-be

Rust library for high-performance tokenization in modern language models

created 5 years ago
323 stars

Top 85.3% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides high-performance tokenizers for modern language models, including WordPiece, BPE, and Unigram (SentencePiece) algorithms. It targets researchers and developers working with state-of-the-art transformer architectures, offering efficient tokenization for models like BERT, GPT, RoBERTa, and more.

How It Works

The library leverages Rust for its performance benefits, implementing various tokenization strategies. WordPiece tokenizers support both single-threaded and multi-threaded processing, while BPE tokenizers utilize a shared cache and are single-threaded. This approach aims to deliver faster tokenization compared to pure Python implementations.

Quick Start & Requirements

  • Rust Usage: Requires Rust and manual download of tokenizer vocabulary/merge files from the Hugging Face Transformers library.
  • Python Usage: Requires a Rust nightly build. Installation involves running python setup.py install within the /python-bindings directory after setting up the Rust nightly toolchain.
  • Dependencies: Python bindings require PyTorch and Hugging Face's transformers library.
  • Documentation: Usage examples are available in the /tests folder.

Highlighted Details

  • Supports a wide range of transformer architectures including BERT, GPT, RoBERTa, DeBERTa, and more.
  • Offers both WordPiece and BPE tokenization algorithms.
  • Includes Python bindings for easier integration with existing ML workflows.
  • Provides multi-threaded processing for WordPiece tokenizers.

Maintenance & Community

The project is maintained by Guillaume Be. Further community or roadmap information is not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library requires manual downloading of tokenizer configuration files. Python bindings necessitate a Rust nightly build, which may not be stable for production environments. The README does not detail specific performance benchmarks or comparisons against other tokenization libraries.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.