tiktoken-rs  by zurawiki

Rust tokenizer library for GPT models and tiktoken

created 2 years ago
326 stars

Top 84.8% on sourcepulse

GitHubView on GitHub
Project Summary

This Rust library provides a high-performance, thread-safe implementation of OpenAI's tiktoken tokenizer, designed for developers and researchers working with large language models. It offers efficient encoding and decoding of text into token IDs, crucial for managing context windows and API costs.

How It Works

The library leverages Rust's performance characteristics and memory safety to deliver a fast and reliable tokenizer. It directly implements the tiktoken algorithm, including support for various encoding types like cl100k_base used by GPT-3.5 and GPT-4. This native implementation avoids the overhead of cross-language bindings, making it ideal for performance-critical applications.

Quick Start & Requirements

  • Install via cargo add tiktoken-rs.
  • Requires Rust toolchain (stable or nightly).
  • See official documentation for detailed usage.

Highlighted Details

  • Thread-safe and performant Rust implementation of tiktoken.
  • Supports multiple encoding types (cl100k_base, p50k_base, r50k_base).
  • Efficiently counts tokens without full encoding.
  • Low-level API for direct control.

Maintenance & Community

The project is maintained by zurawiki. Community engagement can be found via GitHub issues.

Licensing & Compatibility

Licensed under the MIT license, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is a direct implementation and may lag behind official tiktoken updates. It focuses on core tokenization functionality and does not include higher-level text processing utilities.

Health Check
Last commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.