Minimal BPE encoder/decoder for LLM tokenization
Top 5.2% on sourcepulse
This repository provides a minimal, clean, and hackable implementation of the Byte Pair Encoding (BPE) algorithm, essential for modern Large Language Model (LLM) tokenization. It's designed for researchers and developers who need to understand, customize, or integrate BPE tokenization into their projects, offering both a basic BPE and a GPT-4 compatible version.
How It Works
The library implements BPE by operating directly on UTF-8 encoded strings. It starts with individual bytes as tokens (0-255) and iteratively merges the most frequent adjacent pairs of tokens to build a vocabulary. Two main tokenizers are provided: BasicTokenizer
for direct text BPE and RegexTokenizer
which preprocesses text using regex patterns to maintain semantic boundaries before merging, mirroring GPT-2 and GPT-4 approaches. A GPT4Tokenizer
class specifically replicates GPT-4's tokenization using the tiktoken
library's patterns.
Quick Start & Requirements
To use the basic functionality:
from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
tokenizer.train("aaabdaaabac", 256 + 3)
print(tokenizer.encode("aaabdaaabac"))
To verify GPT-4 compatibility, tiktoken
must be installed (pip install tiktoken
). No specific hardware or OS requirements are mentioned beyond standard Python environments.
Highlighted Details
GPT4Tokenizer
.<|endoftext|>
).Maintenance & Community
The project is maintained by Andrej Karpathy. A Rust implementation (minbpe-rs
) is available, and a step-by-step guide for building BPE is provided in exercise.md
. A YouTube lecture explaining the code is also linked.
Licensing & Compatibility
The project is released under the MIT License, permitting commercial use and integration into closed-source projects.
Limitations & Caveats
The README mentions a "TODO" for a more optimized Python version for large files and vocabs, and a C or Rust version. It also notes potential future work to support GPT-2/3/3.5 and replicate SentencePiece for Llama.
1 year ago
1 day