minbpe  by karpathy

Minimal BPE encoder/decoder for LLM tokenization

created 1 year ago
9,786 stars

Top 5.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a minimal, clean, and hackable implementation of the Byte Pair Encoding (BPE) algorithm, essential for modern Large Language Model (LLM) tokenization. It's designed for researchers and developers who need to understand, customize, or integrate BPE tokenization into their projects, offering both a basic BPE and a GPT-4 compatible version.

How It Works

The library implements BPE by operating directly on UTF-8 encoded strings. It starts with individual bytes as tokens (0-255) and iteratively merges the most frequent adjacent pairs of tokens to build a vocabulary. Two main tokenizers are provided: BasicTokenizer for direct text BPE and RegexTokenizer which preprocesses text using regex patterns to maintain semantic boundaries before merging, mirroring GPT-2 and GPT-4 approaches. A GPT4Tokenizer class specifically replicates GPT-4's tokenization using the tiktoken library's patterns.

Quick Start & Requirements

To use the basic functionality:

from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
tokenizer.train("aaabdaaabac", 256 + 3)
print(tokenizer.encode("aaabdaaabac"))

To verify GPT-4 compatibility, tiktoken must be installed (pip install tiktoken). No specific hardware or OS requirements are mentioned beyond standard Python environments.

Highlighted Details

  • Offers exact GPT-4 tokenization replication via GPT4Tokenizer.
  • Allows training custom tokenizers from scratch on user-provided text.
  • Supports registering and handling special tokens (e.g., <|endoftext|>).
  • Code is intentionally kept short, commented, and hackable for educational purposes.

Maintenance & Community

The project is maintained by Andrej Karpathy. A Rust implementation (minbpe-rs) is available, and a step-by-step guide for building BPE is provided in exercise.md. A YouTube lecture explaining the code is also linked.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions a "TODO" for a more optimized Python version for large files and vocabs, and a C or Rust version. It also notes potential future work to support GPT-2/3/3.5 and replicate SentencePiece for Llama.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
223 stars in the last 90 days

Explore Similar Projects

Starred by Simon Willison Simon Willison(Author of Django), Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
1 more.

GPT-3-Encoder by latitudegames

0%
719
JS library for GPT-2/GPT-3 text tokenization
created 4 years ago
updated 2 years ago
Feedback? Help us improve.