tiktoken-rs by zurawiki

Rust tokenizer library for GPT models and tiktoken

Created 2 years ago

360 stars

Top 77.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Stas Kelvich

Cofounder of Neon

Project Summary

This Rust library provides a high-performance, thread-safe implementation of OpenAI's tiktoken tokenizer, designed for developers and researchers working with large language models. It offers efficient encoding and decoding of text into token IDs, crucial for managing context windows and API costs.

How It Works

The library leverages Rust's performance characteristics and memory safety to deliver a fast and reliable tokenizer. It directly implements the tiktoken algorithm, including support for various encoding types like cl100k_base used by GPT-3.5 and GPT-4. This native implementation avoids the overhead of cross-language bindings, making it ideal for performance-critical applications.

Quick Start & Requirements

Install via cargo add tiktoken-rs.
Requires Rust toolchain (stable or nightly).
See official documentation for detailed usage.

Highlighted Details

Thread-safe and performant Rust implementation of tiktoken.
Supports multiple encoding types (cl100k_base, p50k_base, r50k_base).
Efficiently counts tokens without full encoding.
Low-level API for direct control.

Maintenance & Community

The project is maintained by zurawiki. Community engagement can be found via GitHub issues.

Licensing & Compatibility

Licensed under the MIT license, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is a direct implementation and may lag behind official tiktoken updates. It focuses on core tokenization functionality and does not include higher-level text processing utilities.

Health Check

Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days