tokenmonster  by alasdairforsythe

Subword tokenizer and vocabulary trainer for multiple languages

created 2 years ago
594 stars

Top 55.6% on sourcepulse

GitHubView on GitHub
Project Summary

TokenMonster offers an "ungreedy" subword tokenization algorithm and vocabulary trainer designed to improve the efficiency and performance of large language models. It targets developers and researchers working with LLMs, enabling faster inference, reduced computational costs, and longer context windows by creating smaller, more optimal vocabularies.

How It Works

TokenMonster employs a novel distillation-inspired training process that starts with all possible tokens and iteratively refines them to a target vocabulary size. Unlike Byte-Pair Encoding (BPE), which merges tokens greedily, TokenMonster's "ungreedy" approach explores multiple tokenization paths simultaneously, selecting the optimal one at each step. This method, combined with targeted vocabulary generation for specific datasets and tokenization algorithms, aims to produce significantly more efficient token representations.

Quick Start & Requirements

  • Install via pip: pip install tokenmonster
  • Usage example:
    import tokenmonster
    vocab = tokenmonster.load("englishcode-32000-consistent-v1")
    tokens = vocab.tokenize("This is a test.")
    
  • Pretrained vocabularies can be loaded by name; they are downloaded automatically.
  • Official documentation and a browser-based tester are available.

Highlighted Details

  • Claims to outperform other tokenization algorithms in efficiency and speed.
  • Offers 5 optimization modes (unfiltered, clean, balanced, consistent, strict) and a "capcode" marker for case encoding.
  • Provides 442 pretrained vocabularies for various datasets (code, English, fiction) and sizes.
  • Supports Python, Go, and JavaScript implementations for tokenization and detokenization.

Maintenance & Community

  • Support is available via the "Discussions" tab on the GitHub repository.
  • Paid consultation services are offered for custom vocabulary generation.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • The README mentions that pretrained vocabularies are still being trained, and availability should be checked.
  • While aiming for broad compatibility, specific integration details with various LLM frameworks are not detailed.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.