tokenizer  by tiktoken-go

Go port of OpenAI's tiktoken tokenizer

created 2 years ago
378 stars

Top 76.3% on sourcepulse

GitHubView on GitHub
Project Summary

This Go library provides a pure Go implementation of OpenAI's tiktoken tokenizer, enabling efficient text encoding and decoding for large language models within Go applications. It targets developers needing to integrate LLM tokenization capabilities directly into their Go services without external dependencies or Python runtimes.

How It Works

The library directly embeds OpenAI's vocabulary data within Go maps, compiled during the build process. This approach avoids runtime downloads and caching, leading to potentially better performance and faster startup times compared to Python implementations that rely on external file loading. It supports multiple encoding types used by OpenAI models.

Quick Start & Requirements

  • Install: go get github.com/tiktoken-go/tokenizer
  • Requirements: Go toolchain.
  • Usage: Import github.com/tiktoken-go/tokenizer and use tokenizer.Get() with desired encoding (e.g., tokenizer.Cl100kBase). A CLI tool is also included for direct use.

Highlighted Details

  • Pure Go implementation, no Python dependency.
  • Embeds vocabularies for faster startup and runtime.
  • Supports cl100k_base, o200k_base, r50k_base, p50k_base, p50k_edit encodings.
  • Includes a command-line interface for direct tokenization.

Maintenance & Community

The project appears to be actively maintained, with a clear list of completed and pending tasks in the README. No specific community channels or external contributors are highlighted.

Licensing & Compatibility

The README does not explicitly state a license. Given it's a port of OpenAI's tokenizer, users should verify licensing implications, especially for commercial use.

Limitations & Caveats

The library embeds ~4MB of vocabulary data directly into the Go binary. Handling of special tokens and the gpt-2 model encoding are listed as pending.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.