TokenDagger by M4THYOU

Fast tokenization for large-scale text processing

Created 5 months ago

461 stars

Top 65.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Bryan Helmig

Cofounder of Zapier

Sam Bhagwat

Cofounder of Mastra, Gatsby

Simon Willison

Coauthor of Django

Project Summary

TokenDagger offers a high-performance, drop-in replacement for OpenAI's TikToken library, targeting developers and researchers working with large-scale text processing. It aims to significantly improve tokenization speed and reduce memory consumption compared to existing solutions.

How It Works

TokenDagger leverages an optimized PCRE2 regex engine for efficient token pattern matching and a simplified Byte Pair Encoding (BPE) algorithm to minimize the performance impact of large special token vocabularies. This approach allows for faster processing, particularly on code samples, and reduced memory usage, outperforming Hugging Face's batch tokenizer in memory-intensive scenarios.

Quick Start & Requirements

Install via pip: pip install tokendagger
For development: git clone git@github.com:M4THYOU/TokenDagger.git, git submodule update --init --recursive
Prerequisites: PCRE2 (install libpcre2-dev on Debian/Ubuntu), python3-dev.
Testing requires tiktoken.
Official quick-start and usage examples are available in the README.

Highlighted Details

Claims 2x throughput and 4x faster performance on code sample tokenization compared to TikToken.
Benchmarks show significantly lower memory usage than Hugging Face's batch tokenizer.
Supports meta-llama/Llama-4-Scout-17B-16E-Instruct and mistralai/Mistral-8B-Instruct-2410 tokenizers.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. This requires further investigation for commercial or closed-source integration.

Limitations & Caveats

The project's license is not specified, which may pose a barrier to commercial use. Benchmarks were conducted on specific hardware (AMD EPYC), and performance may vary on different architectures.

Health Check

Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days