gpt-tokenizer by niieani

JS library for OpenAI GPT model token encoding/decoding

Created 2 years ago

718 stars

Top 47.9% on SourcePulse

3 Experts Love This Project

timsuchanek

Founder of expand.ai

mckaywrigley

Founder of Takeoff AI

transitive-bullshit

Founder of Agentic

Project Summary

This library provides the fastest JavaScript Byte Pair Encoding (BPE) tokenizer for OpenAI's GPT models, designed for developers working with LLMs in JavaScript environments. It offers a performant, low-footprint solution for encoding and decoding text to and from tokens, supporting all current OpenAI models and offering advanced features like chat tokenization and asynchronous streaming.

How It Works

The library is a direct port of OpenAI's tiktoken library, implemented in TypeScript for type safety and performance. It utilizes BPE algorithms to convert text into integer token sequences, mirroring OpenAI's official tokenization. Key advantages include synchronous operation, generator functions for streaming, efficient isWithinTokenLimit checks, and a memory-efficient design without global caches.

Quick Start & Requirements

Install via npm: npm install gpt-tokenizer
Usage examples and detailed API documentation are available in the README.
A live playground is accessible at: https://gpt-tokenizer.dev/

Highlighted Details

Supports all OpenAI models, including GPT-4o, GPT-4, and GPT-3.5-turbo, with various encodings (o200k_base, cl100k_base, etc.).
Offers encodeChat for efficient chat message tokenization.
Provides asynchronous generator functions (decodeAsyncGenerator) for streaming token processing.
Includes an isWithinTokenLimit function for quick token count checks without full encoding.
Benchmarked as the fastest tokenizer on NPM, outperforming WASM/node bindings.

Maintenance & Community

Trusted by major organizations like Microsoft (Teams) and Elastic (Kibana).
Contributions are welcome via pull requests or issues.
Discussions are available for ideas and inquiries.

Licensing & Compatibility

MIT License.
Fully compatible with commercial and closed-source applications.

Limitations & Caveats

By default, all special tokens are disallowed during encoding; custom handling is required via EncodeOptions.

Health Check

Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)

1

Issues (30d)

4

Star History

15 stars in the last 30 days

Explore Similar Projects

Starred by

Ed Huang

Ed Huang(Cofounder of PingCAP).

ttok by simonw

CLI tool for counting and truncating text based on tokens

Created 2 years ago

Updated 1 year ago

tokenizer by tiktoken-go

Go port of OpenAI's tiktoken tokenizer

Created 2 years ago

Updated 5 months ago

prompt-optimizer by vaibkumr

CLI tool to minimize LLM token complexity, reducing API costs

Created 2 years ago

Updated 1 year ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic),

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm), and

1 more.

rust-tokenizers by guillaume-be

Rust library for high-performance tokenization in modern language models

Created 6 years ago

Updated 2 years ago

Starred by

Amin Ahmad

Amin Ahmad(Cofounder of Vectara).

jtokkit by knuddelsgmbh

Java tokenizer library for OpenAI models

Created 2 years ago

Updated 4 days ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

XVERSE-13B by xverse-ai

Multilingual LLM for chat, knowledge QA, and code generation

Created 2 years ago

Updated 1 year ago

Starred by

Stas Kelvich

Stas Kelvich(Cofounder of Neon).

tiktoken-rs by zurawiki

Rust tokenizer library for GPT models and tiktoken

Created 2 years ago

Updated 1 week ago

Starred by

Kaichao You

Kaichao You(Core Maintainer of vLLM),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

3 more.

lm-format-enforcer by noamgat

Format enforcer for language model outputs (JSON, regex, etc.)

Created 2 years ago

Updated 4 months ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Jared Palmer

Jared Palmer(SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX), and

2 more.

GPT-3-Encoder by latitudegames

JS library for GPT-2/GPT-3 text tokenization

Created 5 years ago

Updated 2 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Carol Willing

Carol Willing(Core Contributor to CPython, Jupyter), and

14 more.

minbpe by karpathy

Minimal BPE encoder/decoder for LLM tokenization

Created 1 year ago

Updated 1 year ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Syrus Akbary

Syrus Akbary(Founder of Wasmer), and

25 more.

tokenizers by huggingface

Fast tokenizer library optimized for research and production

Created 6 years ago

Updated 6 days ago

Starred by

Nat Friedman

Nat Friedman(Former CEO of GitHub),

Eric Zhang

Eric Zhang(Founding Engineer at Modal), and

31 more.

tiktoken by openai

Fast BPE tokenizer for OpenAI models

Created 3 years ago

Updated 3 months ago

Feedback? Help us improve.