tiktoken  by openai

Fast BPE tokenizer for OpenAI models

Created 2 years ago
15,946 stars

Top 3.0% on SourcePulse

GitHubView on GitHub
Project Summary

tiktoken is a high-performance Byte Pair Encoding (BPE) tokenizer designed for efficient text processing with OpenAI's language models. It offers a significant speedup over other open-source tokenizers, making it ideal for developers and researchers working with large text datasets or requiring rapid tokenization for API interactions.

How It Works

tiktoken implements BPE, a reversible and lossless text compression algorithm that converts text into numerical tokens. This method is advantageous as it handles arbitrary text, compresses data by representing common subwords as single tokens, and aids models in understanding grammar by recognizing recurring word parts. The library provides direct access to OpenAI's specific encodings and includes an educational submodule for learning BPE mechanics.

Quick Start & Requirements

  • Install via pip: pip install tiktoken
  • Requires Python.
  • Official documentation: tiktoken/core.py
  • Example code: OpenAI Cookbook

Highlighted Details

  • 3-6x faster than comparable open-source tokenizers.
  • Supports OpenAI's model-specific encodings (e.g., gpt-4o).
  • Includes an educational submodule for BPE visualization and training.
  • Extensible via a plugin mechanism (tiktoken_ext) for custom encodings.

Maintenance & Community

  • Primarily maintained by OpenAI.
  • Questions and support are handled via the issue tracker.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The library's primary focus is on OpenAI's models; custom encoding registration requires careful adherence to the namespace package structure and setup.py configuration.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
5
Star History
363 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

lm-format-enforcer by noamgat

0.6%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.