tiktoken by openai

Fast BPE tokenizer for OpenAI models

Created 3 years ago

16,973 stars

Top 2.8% on SourcePulse

View on GitHub

33 Experts Love This Project

Nat Friedman

Former CEO of GitHub

Eric Zhang

Founding Engineer at Modal

Gregor Zunic

Cofounder of Browser Use

Chuan Li

Chief Scientific Officer at Lambda

and 29 more!

Project Summary

tiktoken is a high-performance Byte Pair Encoding (BPE) tokenizer designed for efficient text processing with OpenAI's language models. It offers a significant speedup over other open-source tokenizers, making it ideal for developers and researchers working with large text datasets or requiring rapid tokenization for API interactions.

How It Works

tiktoken implements BPE, a reversible and lossless text compression algorithm that converts text into numerical tokens. This method is advantageous as it handles arbitrary text, compresses data by representing common subwords as single tokens, and aids models in understanding grammar by recognizing recurring word parts. The library provides direct access to OpenAI's specific encodings and includes an educational submodule for learning BPE mechanics.

Quick Start & Requirements

Install via pip: pip install tiktoken
Requires Python.
Official documentation: tiktoken/core.py
Example code: OpenAI Cookbook

Highlighted Details

3-6x faster than comparable open-source tokenizers.
Supports OpenAI's model-specific encodings (e.g., gpt-4o).
Includes an educational submodule for BPE visualization and training.
Extensible via a plugin mechanism (tiktoken_ext) for custom encodings.

Maintenance & Community

Primarily maintained by OpenAI.
Questions and support are handled via the issue tracker.

Licensing & Compatibility

MIT License.
Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The library's primary focus is on OpenAI's models; custom encoding registration requires careful adherence to the namespace package structure and setup.py configuration.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

230 stars in the last 30 days