Discover and explore top open-source AI tools and projects—updated daily.
Fast tokenization for large-scale text processing
Top 67.3% on SourcePulse
TokenDagger offers a high-performance, drop-in replacement for OpenAI's TikToken library, targeting developers and researchers working with large-scale text processing. It aims to significantly improve tokenization speed and reduce memory consumption compared to existing solutions.
How It Works
TokenDagger leverages an optimized PCRE2 regex engine for efficient token pattern matching and a simplified Byte Pair Encoding (BPE) algorithm to minimize the performance impact of large special token vocabularies. This approach allows for faster processing, particularly on code samples, and reduced memory usage, outperforming Hugging Face's batch tokenizer in memory-intensive scenarios.
Quick Start & Requirements
pip install tokendagger
git clone git@github.com:M4THYOU/TokenDagger.git
, git submodule update --init --recursive
libpcre2-dev
on Debian/Ubuntu), python3-dev
.tiktoken
.Highlighted Details
meta-llama/Llama-4-Scout-17B-16E-Instruct
and mistralai/Mistral-8B-Instruct-2410
tokenizers.Maintenance & Community
No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. This requires further investigation for commercial or closed-source integration.
Limitations & Caveats
The project's license is not specified, which may pose a barrier to commercial use. Benchmarks were conducted on specific hardware (AMD EPYC), and performance may vary on different architectures.
1 month ago
Inactive