gpt-tokenizer  by niieani

JS library for OpenAI GPT model token encoding/decoding

created 2 years ago
596 stars

Top 55.4% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides the fastest JavaScript Byte Pair Encoding (BPE) tokenizer for OpenAI's GPT models, designed for developers working with LLMs in JavaScript environments. It offers a performant, low-footprint solution for encoding and decoding text to and from tokens, supporting all current OpenAI models and offering advanced features like chat tokenization and asynchronous streaming.

How It Works

The library is a direct port of OpenAI's tiktoken library, implemented in TypeScript for type safety and performance. It utilizes BPE algorithms to convert text into integer token sequences, mirroring OpenAI's official tokenization. Key advantages include synchronous operation, generator functions for streaming, efficient isWithinTokenLimit checks, and a memory-efficient design without global caches.

Quick Start & Requirements

  • Install via npm: npm install gpt-tokenizer
  • Usage examples and detailed API documentation are available in the README.
  • A live playground is accessible at: https://gpt-tokenizer.dev/

Highlighted Details

  • Supports all OpenAI models, including GPT-4o, GPT-4, and GPT-3.5-turbo, with various encodings (o200k_base, cl100k_base, etc.).
  • Offers encodeChat for efficient chat message tokenization.
  • Provides asynchronous generator functions (decodeAsyncGenerator) for streaming token processing.
  • Includes an isWithinTokenLimit function for quick token count checks without full encoding.
  • Benchmarked as the fastest tokenizer on NPM, outperforming WASM/node bindings.

Maintenance & Community

  • Trusted by major organizations like Microsoft (Teams) and Elastic (Kibana).
  • Contributions are welcome via pull requests or issues.
  • Discussions are available for ideas and inquiries.

Licensing & Compatibility

  • MIT License.
  • Fully compatible with commercial and closed-source applications.

Limitations & Caveats

  • By default, all special tokens are disallowed during encoding; custom handling is required via EncodeOptions.
Health Check
Last commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
34 stars in the last 90 days

Explore Similar Projects

Starred by Simon Willison Simon Willison(Author of Django), Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
1 more.

GPT-3-Encoder by latitudegames

0%
719
JS library for GPT-2/GPT-3 text tokenization
created 4 years ago
updated 2 years ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.