minbpe by karpathy

Minimal BPE encoder/decoder for LLM tokenization

Created 1 year ago

10,256 stars

Top 4.9% on SourcePulse

View on GitHub

16 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Carol Willing

Core Contributor to CPython, Jupyter

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 12 more!

Project Summary

This repository provides a minimal, clean, and hackable implementation of the Byte Pair Encoding (BPE) algorithm, essential for modern Large Language Model (LLM) tokenization. It's designed for researchers and developers who need to understand, customize, or integrate BPE tokenization into their projects, offering both a basic BPE and a GPT-4 compatible version.

How It Works

The library implements BPE by operating directly on UTF-8 encoded strings. It starts with individual bytes as tokens (0-255) and iteratively merges the most frequent adjacent pairs of tokens to build a vocabulary. Two main tokenizers are provided: BasicTokenizer for direct text BPE and RegexTokenizer which preprocesses text using regex patterns to maintain semantic boundaries before merging, mirroring GPT-2 and GPT-4 approaches. A GPT4Tokenizer class specifically replicates GPT-4's tokenization using the tiktoken library's patterns.

Quick Start & Requirements

To use the basic functionality:

from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
tokenizer.train("aaabdaaabac", 256 + 3)
print(tokenizer.encode("aaabdaaabac"))

To verify GPT-4 compatibility, tiktoken must be installed (pip install tiktoken). No specific hardware or OS requirements are mentioned beyond standard Python environments.

Highlighted Details

Offers exact GPT-4 tokenization replication via GPT4Tokenizer.
Allows training custom tokenizers from scratch on user-provided text.
Supports registering and handling special tokens (e.g., <|endoftext|>).
Code is intentionally kept short, commented, and hackable for educational purposes.

Maintenance & Community

The project is maintained by Andrej Karpathy. A Rust implementation (minbpe-rs) is available, and a step-by-step guide for building BPE is provided in exercise.md. A YouTube lecture explaining the code is also linked.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions a "TODO" for a more optimized Python version for large files and vocabs, and a C or Rust version. It also notes potential future work to support GPT-2/3/3.5 and replicate SentencePiece for Llama.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

59 stars in the last 30 days