rustbpe  by karpathy

Efficient Rust library for BPE tokenizer training

Created 1 week ago

New!

285 stars

Top 91.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

rustbpe provides a high-performance Rust library for training Byte Pair Encoding (BPE) tokenizers, specifically designed to complement the tiktoken inference library. It addresses the lack of training capabilities in tiktoken and the complexity of other libraries, offering a streamlined solution for users needing to create custom GPT-style tokenizers efficiently. The library enables training BPE models in Rust and exporting them for fast inference.

How It Works

The core of rustbpe is a BPE training algorithm implemented in Rust, leveraging parallel processing via the rayon crate for speed. It defaults to the GPT-4 style regex pattern for pre-tokenization and includes Python bindings generated using PyO3. This approach allows users to train tokenizers rapidly and then export the resulting vocabulary and patterns directly into a format compatible with tiktoken, facilitating a fast training-to-inference workflow.

Quick Start & Requirements

Installation is straightforward via pip: pip install rustbpe. For development or building from source, clone the repository, set up a virtual environment (e.g., using uv), and install maturin (uv pip install maturin). Then, run maturin develop --release to build the Python bindings. Prerequisites include Rust (installable via rustup.rs) and uv.

Highlighted Details

  • Features fast BPE training utilizing parallel processing (rayon).
  • Includes GPT-4 style regex pre-tokenization by default, with support for custom patterns.
  • Enables direct export of trained tokenizers to the tiktoken format for efficient inference.
  • Provides Python bindings through PyO3, allowing seamless integration into Python workflows.
  • Supports parallel batch encoding for improved throughput.

Maintenance & Community

The project is primarily authored by Andrej Karpathy, who acknowledges limited Rust development background and significant assistance from LLMs (ChatGPT, Claude) in writing the Rust code. While all equality tests pass, the author explicitly invites community feedback via GitHub Issues and Pull Requests regarding Rust code structure, idiomatic practices, and implementation quality.

Licensing & Compatibility

rustbpe is released under the permissive MIT license. This license permits broad usage, including integration into commercial and closed-source applications without significant restrictions.

Limitations & Caveats

The author's self-professed limited Rust expertise and reliance on LLMs for code generation mean the Rust implementation may not be fully idiomatic or optimally structured. Users requiring robust, production-grade Rust code should exercise caution and contribute to code review.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
1
Star History
286 stars in the last 7 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
15 more.

torchtune by meta-pytorch

0.2%
6k
PyTorch library for LLM post-training and experimentation
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.