rust-tokenizers by guillaume-be

Rust library for high-performance tokenization in modern language models

Created 6 years ago

332 stars

Top 82.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Travis Fischer

Founder of Agentic

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Clément Renault

Cofounder of Meilisearch

Project Summary

This library provides high-performance tokenizers for modern language models, including WordPiece, BPE, and Unigram (SentencePiece) algorithms. It targets researchers and developers working with state-of-the-art transformer architectures, offering efficient tokenization for models like BERT, GPT, RoBERTa, and more.

How It Works

The library leverages Rust for its performance benefits, implementing various tokenization strategies. WordPiece tokenizers support both single-threaded and multi-threaded processing, while BPE tokenizers utilize a shared cache and are single-threaded. This approach aims to deliver faster tokenization compared to pure Python implementations.

Quick Start & Requirements

Rust Usage: Requires Rust and manual download of tokenizer vocabulary/merge files from the Hugging Face Transformers library.
Python Usage: Requires a Rust nightly build. Installation involves running python setup.py install within the /python-bindings directory after setting up the Rust nightly toolchain.
Dependencies: Python bindings require PyTorch and Hugging Face's transformers library.
Documentation: Usage examples are available in the /tests folder.

Highlighted Details

Supports a wide range of transformer architectures including BERT, GPT, RoBERTa, DeBERTa, and more.
Offers both WordPiece and BPE tokenization algorithms.
Includes Python bindings for easier integration with existing ML workflows.
Provides multi-threaded processing for WordPiece tokenizers.

Maintenance & Community

The project is maintained by Guillaume Be. Further community or roadmap information is not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library requires manual downloading of tokenizer configuration files. Python bindings necessitate a Rust nightly build, which may not be stable for production environments. The README does not detail specific performance benchmarks or comparisons against other tokenization libraries.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days