rust-tokenizers  by guillaume-be

Rust library for high-performance tokenization in modern language models

Created 5 years ago
326 stars

Top 83.5% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides high-performance tokenizers for modern language models, including WordPiece, BPE, and Unigram (SentencePiece) algorithms. It targets researchers and developers working with state-of-the-art transformer architectures, offering efficient tokenization for models like BERT, GPT, RoBERTa, and more.

How It Works

The library leverages Rust for its performance benefits, implementing various tokenization strategies. WordPiece tokenizers support both single-threaded and multi-threaded processing, while BPE tokenizers utilize a shared cache and are single-threaded. This approach aims to deliver faster tokenization compared to pure Python implementations.

Quick Start & Requirements

  • Rust Usage: Requires Rust and manual download of tokenizer vocabulary/merge files from the Hugging Face Transformers library.
  • Python Usage: Requires a Rust nightly build. Installation involves running python setup.py install within the /python-bindings directory after setting up the Rust nightly toolchain.
  • Dependencies: Python bindings require PyTorch and Hugging Face's transformers library.
  • Documentation: Usage examples are available in the /tests folder.

Highlighted Details

  • Supports a wide range of transformer architectures including BERT, GPT, RoBERTa, DeBERTa, and more.
  • Offers both WordPiece and BPE tokenization algorithms.
  • Includes Python bindings for easier integration with existing ML workflows.
  • Provides multi-threaded processing for WordPiece tokenizers.

Maintenance & Community

The project is maintained by Guillaume Be. Further community or roadmap information is not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The library requires manual downloading of tokenizer configuration files. Python bindings necessitate a Rust nightly build, which may not be stable for production environments. The README does not detail specific performance benchmarks or comparisons against other tokenization libraries.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0.0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 week ago
Feedback? Help us improve.