TokenDagger  by M4THYOU

Fast tokenization for large-scale text processing

Created 2 months ago
446 stars

Top 67.3% on SourcePulse

GitHubView on GitHub
Project Summary

TokenDagger offers a high-performance, drop-in replacement for OpenAI's TikToken library, targeting developers and researchers working with large-scale text processing. It aims to significantly improve tokenization speed and reduce memory consumption compared to existing solutions.

How It Works

TokenDagger leverages an optimized PCRE2 regex engine for efficient token pattern matching and a simplified Byte Pair Encoding (BPE) algorithm to minimize the performance impact of large special token vocabularies. This approach allows for faster processing, particularly on code samples, and reduced memory usage, outperforming Hugging Face's batch tokenizer in memory-intensive scenarios.

Quick Start & Requirements

  • Install via pip: pip install tokendagger
  • For development: git clone git@github.com:M4THYOU/TokenDagger.git, git submodule update --init --recursive
  • Prerequisites: PCRE2 (install libpcre2-dev on Debian/Ubuntu), python3-dev.
  • Testing requires tiktoken.
  • Official quick-start and usage examples are available in the README.

Highlighted Details

  • Claims 2x throughput and 4x faster performance on code sample tokenization compared to TikToken.
  • Benchmarks show significantly lower memory usage than Hugging Face's batch tokenizer.
  • Supports meta-llama/Llama-4-Scout-17B-16E-Instruct and mistralai/Mistral-8B-Instruct-2410 tokenizers.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (like Discord/Slack) is provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. This requires further investigation for commercial or closed-source integration.

Limitations & Caveats

The project's license is not specified, which may pose a barrier to commercial use. Benchmarks were conducted on specific hardware (AMD EPYC), and performance may vary on different architectures.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.