jtokkit  by knuddelsgmbh

Java tokenizer library for OpenAI models

created 2 years ago
676 stars

Top 51.0% on sourcepulse

GitHubView on GitHub
Project Summary

JTokkit is a Java tokenizer library for OpenAI models, offering a JVM-compatible alternative to Python's tiktoken. It enables developers to efficiently count tokens for API requests and manage text processing within Java applications, benefiting those working with large language models in the Java ecosystem.

How It Works

JTokkit implements byte-pair encoding (BPE) algorithms, supporting multiple OpenAI encoding types including cl100k_base (used by GPT-4 and GPT-3.5-turbo) and r50k_base (used by older GPT-3 models). It provides a thread-safe EncodingRegistry to manage different encodings and an Encoding interface for tokenizing and detokenizing text. The library is designed for speed and efficiency, aiming to match or exceed the performance of tiktoken.

Quick Start & Requirements

Highlighted Details

  • 2-3 times faster than comparable tokenizers, as per internal benchmarks.
  • Supports five OpenAI encoding types: r50k_base, p50k_base, p50k_edit, cl100k_base, and o200k_base.
  • Zero external dependencies, simplifying integration.
  • Extensible API for custom encoding algorithms or BPE parameters.

Maintenance & Community

The project is maintained by knuddelsgmbh. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The library focuses solely on tokenization for OpenAI models and does not include other NLP functionalities. While performance is claimed to be high, specific benchmark details are located in a separate directory.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.