Go port of OpenAI's tiktoken tokenizer
Top 76.3% on sourcepulse
This Go library provides a pure Go implementation of OpenAI's tiktoken
tokenizer, enabling efficient text encoding and decoding for large language models within Go applications. It targets developers needing to integrate LLM tokenization capabilities directly into their Go services without external dependencies or Python runtimes.
How It Works
The library directly embeds OpenAI's vocabulary data within Go maps, compiled during the build process. This approach avoids runtime downloads and caching, leading to potentially better performance and faster startup times compared to Python implementations that rely on external file loading. It supports multiple encoding types used by OpenAI models.
Quick Start & Requirements
go get github.com/tiktoken-go/tokenizer
github.com/tiktoken-go/tokenizer
and use tokenizer.Get()
with desired encoding (e.g., tokenizer.Cl100kBase
). A CLI tool is also included for direct use.Highlighted Details
cl100k_base
, o200k_base
, r50k_base
, p50k_base
, p50k_edit
encodings.Maintenance & Community
The project appears to be actively maintained, with a clear list of completed and pending tasks in the README. No specific community channels or external contributors are highlighted.
Licensing & Compatibility
The README does not explicitly state a license. Given it's a port of OpenAI's tokenizer, users should verify licensing implications, especially for commercial use.
Limitations & Caveats
The library embeds ~4MB of vocabulary data directly into the Go binary. Handling of special tokens and the gpt-2
model encoding are listed as pending.
1 month ago
1 day