jtokkit by knuddelsgmbh

Java tokenizer library for OpenAI models

Created 2 years ago

703 stars

Top 48.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Amin Ahmad

Cofounder of Vectara

Project Summary

JTokkit is a Java tokenizer library for OpenAI models, offering a JVM-compatible alternative to Python's tiktoken. It enables developers to efficiently count tokens for API requests and manage text processing within Java applications, benefiting those working with large language models in the Java ecosystem.

How It Works

JTokkit implements byte-pair encoding (BPE) algorithms, supporting multiple OpenAI encoding types including cl100k_base (used by GPT-4 and GPT-3.5-turbo) and r50k_base (used by older GPT-3 models). It provides a thread-safe EncodingRegistry to manage different encodings and an Encoding interface for tokenizing and detokenizing text. The library is designed for speed and efficiency, aiming to match or exceed the performance of tiktoken.

Quick Start & Requirements

Installation: Add the Maven dependency com.knuddels:jtokkit:1.1.0 or the Gradle equivalent.
Prerequisites: Java 8 or above. No other external dependencies are required.
Documentation: https://github.com/knuddelsgmbh/jtokkit#getting-started

Highlighted Details

2-3 times faster than comparable tokenizers, as per internal benchmarks.
Supports five OpenAI encoding types: r50k_base, p50k_base, p50k_edit, cl100k_base, and o200k_base.
Zero external dependencies, simplifying integration.
Extensible API for custom encoding algorithms or BPE parameters.

Maintenance & Community

The project is maintained by knuddelsgmbh. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The library focuses solely on tokenization for OpenAI models and does not include other NLP functionalities. While performance is claimed to be high, specific benchmark details are located in a separate directory.

jtokkit by knuddelsgmbh

Explore Similar Projects

tokenizer by tiktoken-go

prompt-optimizer by vaibkumr

rust-tokenizers by guillaume-be

Binder by xlang-ai

llguidance by guidance-ai

ai4j by LnYo-Cly

gpt-tokenizer by niieani

tiktoken-rs by zurawiki

BlingFire by microsoft

deepseek4j by pig-mesh

api-for-open-llm by xusenlinzy

tiktoken by openai