Java tokenizer library for OpenAI models
Top 51.0% on sourcepulse
JTokkit is a Java tokenizer library for OpenAI models, offering a JVM-compatible alternative to Python's tiktoken
. It enables developers to efficiently count tokens for API requests and manage text processing within Java applications, benefiting those working with large language models in the Java ecosystem.
How It Works
JTokkit implements byte-pair encoding (BPE) algorithms, supporting multiple OpenAI encoding types including cl100k_base
(used by GPT-4 and GPT-3.5-turbo) and r50k_base
(used by older GPT-3 models). It provides a thread-safe EncodingRegistry
to manage different encodings and an Encoding
interface for tokenizing and detokenizing text. The library is designed for speed and efficiency, aiming to match or exceed the performance of tiktoken
.
Quick Start & Requirements
com.knuddels:jtokkit:1.1.0
or the Gradle equivalent.Highlighted Details
r50k_base
, p50k_base
, p50k_edit
, cl100k_base
, and o200k_base
.Maintenance & Community
The project is maintained by knuddelsgmbh. No specific community channels or roadmap links are provided in the README.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The library focuses solely on tokenization for OpenAI models and does not include other NLP functionalities. While performance is claimed to be high, specific benchmark details are located in a separate directory.
1 week ago
Inactive