Discover and explore top open-source AI tools and projects—updated daily.
M4THYOUFast tokenization for large-scale text processing
Top 65.6% on SourcePulse
TokenDagger offers a high-performance, drop-in replacement for OpenAI's TikToken library, targeting developers and researchers working with large-scale text processing. It aims to significantly improve tokenization speed and reduce memory consumption compared to existing solutions.
How It Works
TokenDagger leverages an optimized PCRE2 regex engine for efficient token pattern matching and a simplified Byte Pair Encoding (BPE) algorithm to minimize the performance impact of large special token vocabularies. This approach allows for faster processing, particularly on code samples, and reduced memory usage, outperforming Hugging Face's batch tokenizer in memory-intensive scenarios.
Quick Start & Requirements
pip install tokendaggergit clone git@github.com:M4THYOU/TokenDagger.git, git submodule update --init --recursivelibpcre2-dev on Debian/Ubuntu), python3-dev.tiktoken.Highlighted Details
meta-llama/Llama-4-Scout-17B-16E-Instruct and mistralai/Mistral-8B-Instruct-2410 tokenizers.Maintenance & Community
No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. This requires further investigation for commercial or closed-source integration.
Limitations & Caveats
The project's license is not specified, which may pose a barrier to commercial use. Benchmarks were conducted on specific hardware (AMD EPYC), and performance may vary on different architectures.
5 months ago
Inactive
minimaxir
karpathy
huggingface