gpt-tokenizer  by niieani

JS library for OpenAI GPT model token encoding/decoding

Created 2 years ago
620 stars

Top 53.2% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides the fastest JavaScript Byte Pair Encoding (BPE) tokenizer for OpenAI's GPT models, designed for developers working with LLMs in JavaScript environments. It offers a performant, low-footprint solution for encoding and decoding text to and from tokens, supporting all current OpenAI models and offering advanced features like chat tokenization and asynchronous streaming.

How It Works

The library is a direct port of OpenAI's tiktoken library, implemented in TypeScript for type safety and performance. It utilizes BPE algorithms to convert text into integer token sequences, mirroring OpenAI's official tokenization. Key advantages include synchronous operation, generator functions for streaming, efficient isWithinTokenLimit checks, and a memory-efficient design without global caches.

Quick Start & Requirements

  • Install via npm: npm install gpt-tokenizer
  • Usage examples and detailed API documentation are available in the README.
  • A live playground is accessible at: https://gpt-tokenizer.dev/

Highlighted Details

  • Supports all OpenAI models, including GPT-4o, GPT-4, and GPT-3.5-turbo, with various encodings (o200k_base, cl100k_base, etc.).
  • Offers encodeChat for efficient chat message tokenization.
  • Provides asynchronous generator functions (decodeAsyncGenerator) for streaming token processing.
  • Includes an isWithinTokenLimit function for quick token count checks without full encoding.
  • Benchmarked as the fastest tokenizer on NPM, outperforming WASM/node bindings.

Maintenance & Community

  • Trusted by major organizations like Microsoft (Teams) and Elastic (Kibana).
  • Contributions are welcome via pull requests or issues.
  • Discussions are available for ideas and inquiries.

Licensing & Compatibility

  • MIT License.
  • Fully compatible with commercial and closed-source applications.

Limitations & Caveats

  • By default, all special tokens are disallowed during encoding; custom handling is required via EncodeOptions.
Health Check
Last Commit

1 month ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

lm-format-enforcer by noamgat

0.6%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
Created 2 years ago
Updated 3 weeks ago
Starred by Simon Willison Simon Willison(Coauthor of Django), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
2 more.

GPT-3-Encoder by latitudegames

0%
721
JS library for GPT-2/GPT-3 text tokenization
Created 5 years ago
Updated 2 years ago
Feedback? Help us improve.