tokenizer by tiktoken-go

Go port of OpenAI's tiktoken tokenizer

Created 2 years ago

411 stars

Top 71.1% on SourcePulse

Project Summary

This Go library provides a pure Go implementation of OpenAI's tiktoken tokenizer, enabling efficient text encoding and decoding for large language models within Go applications. It targets developers needing to integrate LLM tokenization capabilities directly into their Go services without external dependencies or Python runtimes.

How It Works

The library directly embeds OpenAI's vocabulary data within Go maps, compiled during the build process. This approach avoids runtime downloads and caching, leading to potentially better performance and faster startup times compared to Python implementations that rely on external file loading. It supports multiple encoding types used by OpenAI models.

Quick Start & Requirements

Install: go get github.com/tiktoken-go/tokenizer
Requirements: Go toolchain.
Usage: Import github.com/tiktoken-go/tokenizer and use tokenizer.Get() with desired encoding (e.g., tokenizer.Cl100kBase). A CLI tool is also included for direct use.

Highlighted Details

Pure Go implementation, no Python dependency.
Embeds vocabularies for faster startup and runtime.
Supports cl100k_base, o200k_base, r50k_base, p50k_base, p50k_edit encodings.
Includes a command-line interface for direct tokenization.

Maintenance & Community

The project appears to be actively maintained, with a clear list of completed and pending tasks in the README. No specific community channels or external contributors are highlighted.

Licensing & Compatibility

The README does not explicitly state a license. Given it's a port of OpenAI's tokenizer, users should verify licensing implications, especially for commercial use.

Limitations & Caveats

The library embeds ~4MB of vocabulary data directly into the Go binary. Handling of special tokens and the gpt-2 model encoding are listed as pending.

tokenizer by tiktoken-go

Explore Similar Projects

gtt by eeeXun

Translate-It by iSegaro

text2text by artitw

json-translate by ViggoZ

jtokkit by knuddelsgmbh

gpt-tokenizer by niieani

json-translator by mololab

gitkraken-chinese by yk47g

ebook-GPT-translator by jesselau76

LunaTranslator by HIllya51

tiktoken by openai

LibreTranslate by LibreTranslate