tokenmonster by alasdairforsythe

Subword tokenizer and vocabulary trainer for multiple languages

Created 2 years ago

613 stars

Top 53.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Phil Wang

Prolific Research Paper Implementer

Project Summary

TokenMonster offers an "ungreedy" subword tokenization algorithm and vocabulary trainer designed to improve the efficiency and performance of large language models. It targets developers and researchers working with LLMs, enabling faster inference, reduced computational costs, and longer context windows by creating smaller, more optimal vocabularies.

How It Works

TokenMonster employs a novel distillation-inspired training process that starts with all possible tokens and iteratively refines them to a target vocabulary size. Unlike Byte-Pair Encoding (BPE), which merges tokens greedily, TokenMonster's "ungreedy" approach explores multiple tokenization paths simultaneously, selecting the optimal one at each step. This method, combined with targeted vocabulary generation for specific datasets and tokenization algorithms, aims to produce significantly more efficient token representations.

Quick Start & Requirements

Install via pip: pip install tokenmonster

Usage example:

import tokenmonster
vocab = tokenmonster.load("englishcode-32000-consistent-v1")
tokens = vocab.tokenize("This is a test.")

Pretrained vocabularies can be loaded by name; they are downloaded automatically.
Official documentation and a browser-based tester are available.

Highlighted Details

Claims to outperform other tokenization algorithms in efficiency and speed.
Offers 5 optimization modes (unfiltered, clean, balanced, consistent, strict) and a "capcode" marker for case encoding.
Provides 442 pretrained vocabularies for various datasets (code, English, fiction) and sizes.
Supports Python, Go, and JavaScript implementations for tokenization and detokenization.

Maintenance & Community

Support is available via the "Discussions" tab on the GitHub repository.
Paid consultation services are offered for custom vocabulary generation.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The README mentions that pretrained vocabularies are still being trained, and availability should be checked.
While aiming for broad compatibility, specific integration details with various LLM frameworks are not detailed.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days