tiktoken  by openai

Fast BPE tokenizer for OpenAI models

created 2 years ago
15,261 stars

Top 3.3% on sourcepulse

GitHubView on GitHub
Project Summary

tiktoken is a high-performance Byte Pair Encoding (BPE) tokenizer designed for efficient text processing with OpenAI's language models. It offers a significant speedup over other open-source tokenizers, making it ideal for developers and researchers working with large text datasets or requiring rapid tokenization for API interactions.

How It Works

tiktoken implements BPE, a reversible and lossless text compression algorithm that converts text into numerical tokens. This method is advantageous as it handles arbitrary text, compresses data by representing common subwords as single tokens, and aids models in understanding grammar by recognizing recurring word parts. The library provides direct access to OpenAI's specific encodings and includes an educational submodule for learning BPE mechanics.

Quick Start & Requirements

  • Install via pip: pip install tiktoken
  • Requires Python.
  • Official documentation: tiktoken/core.py
  • Example code: OpenAI Cookbook

Highlighted Details

  • 3-6x faster than comparable open-source tokenizers.
  • Supports OpenAI's model-specific encodings (e.g., gpt-4o).
  • Includes an educational submodule for BPE visualization and training.
  • Extensible via a plugin mechanism (tiktoken_ext) for custom encodings.

Maintenance & Community

  • Primarily maintained by OpenAI.
  • Questions and support are handled via the issue tracker.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The library's primary focus is on OpenAI's models; custom encoding registration requires careful adherence to the namespace package structure and setup.py configuration.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
6
Star History
979 stars in the last 90 days

Explore Similar Projects

Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.