jtokkit  by knuddelsgmbh

Java tokenizer library for OpenAI models

Created 2 years ago
687 stars

Top 49.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

JTokkit is a Java tokenizer library for OpenAI models, offering a JVM-compatible alternative to Python's tiktoken. It enables developers to efficiently count tokens for API requests and manage text processing within Java applications, benefiting those working with large language models in the Java ecosystem.

How It Works

JTokkit implements byte-pair encoding (BPE) algorithms, supporting multiple OpenAI encoding types including cl100k_base (used by GPT-4 and GPT-3.5-turbo) and r50k_base (used by older GPT-3 models). It provides a thread-safe EncodingRegistry to manage different encodings and an Encoding interface for tokenizing and detokenizing text. The library is designed for speed and efficiency, aiming to match or exceed the performance of tiktoken.

Quick Start & Requirements

Highlighted Details

  • 2-3 times faster than comparable tokenizers, as per internal benchmarks.
  • Supports five OpenAI encoding types: r50k_base, p50k_base, p50k_edit, cl100k_base, and o200k_base.
  • Zero external dependencies, simplifying integration.
  • Extensible API for custom encoding algorithms or BPE parameters.

Maintenance & Community

The project is maintained by knuddelsgmbh. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The library focuses solely on tokenization for OpenAI models and does not include other NLP functionalities. While performance is claimed to be high, specific benchmark details are located in a separate directory.

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
5
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.