Discover and explore top open-source AI tools and projects—updated daily.
karpathyEfficient Rust library for BPE tokenizer training
New!
Top 91.9% on SourcePulse
Summary
rustbpe provides a high-performance Rust library for training Byte Pair Encoding (BPE) tokenizers, specifically designed to complement the tiktoken inference library. It addresses the lack of training capabilities in tiktoken and the complexity of other libraries, offering a streamlined solution for users needing to create custom GPT-style tokenizers efficiently. The library enables training BPE models in Rust and exporting them for fast inference.
How It Works
The core of rustbpe is a BPE training algorithm implemented in Rust, leveraging parallel processing via the rayon crate for speed. It defaults to the GPT-4 style regex pattern for pre-tokenization and includes Python bindings generated using PyO3. This approach allows users to train tokenizers rapidly and then export the resulting vocabulary and patterns directly into a format compatible with tiktoken, facilitating a fast training-to-inference workflow.
Quick Start & Requirements
Installation is straightforward via pip: pip install rustbpe. For development or building from source, clone the repository, set up a virtual environment (e.g., using uv), and install maturin (uv pip install maturin). Then, run maturin develop --release to build the Python bindings. Prerequisites include Rust (installable via rustup.rs) and uv.
Highlighted Details
rayon).tiktoken format for efficient inference.Maintenance & Community
The project is primarily authored by Andrej Karpathy, who acknowledges limited Rust development background and significant assistance from LLMs (ChatGPT, Claude) in writing the Rust code. While all equality tests pass, the author explicitly invites community feedback via GitHub Issues and Pull Requests regarding Rust code structure, idiomatic practices, and implementation quality.
Licensing & Compatibility
rustbpe is released under the permissive MIT license. This license permits broad usage, including integration into commercial and closed-source applications without significant restrictions.
Limitations & Caveats
The author's self-professed limited Rust expertise and reliance on LLMs for code generation mean the Rust implementation may not be fully idiomatic or optimally structured. Users requiring robust, production-grade Rust code should exercise caution and contribute to code review.
1 week ago
Inactive
minimaxir
guillaume-be
meta-pytorch
huggingface