ttok by simonw

CLI tool for counting and truncating text based on tokens

Created 2 years ago

380 stars

Top 75.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ed Huang

Cofounder of PingCAP

Project Summary

ttok is a command-line utility for counting and truncating text based on token counts, primarily for use with Large Language Models (LLMs). It leverages OpenAI's tiktoken library, making it useful for developers and researchers working with LLM APIs that have token-based pricing or context window limits.

How It Works

The tool utilizes the tiktoken library to encode text into integer token IDs, mirroring how LLMs process input. It supports various OpenAI models by allowing users to specify the model via the -m flag, ensuring accurate tokenization for different LLM architectures. The core functionality includes counting tokens in provided text or piped input and truncating text to a specified token limit using the -t flag.

Quick Start & Requirements

Install via pip: pip install ttok
Install via Homebrew: brew install simonw/llm/ttok
Requires Python.

Highlighted Details

Supports token counting and truncation for various LLM models (GPT-4, GPT-3.5, GPT-2, etc.).
Can display raw token IDs (--encode) and decode them back to text (--decode).
Allows detailed token breakdown (--tokens).
Can append tokens from arguments to piped input (-i -).

Maintenance & Community

Developed by Simon Willison.
Source code available on GitHub: https://github.com/simonw/ttok

Licensing & Compatibility

MIT License.
Compatible with commercial use and closed-source applications.

Limitations & Caveats

The tool relies on the tiktoken library, meaning its accuracy is tied to the library's updates and support for specific models. No specific limitations are mentioned in the README regarding unsupported platforms or known bugs.

ttok by simonw

Explore Similar Projects

rellm by r2d4

antislop-sampler by sam-paech

rust-tokenizers by guillaume-be

chatgpt-subtitle-translator by Cerlancism

gpt-tokenizer by niieani

lm-format-enforcer by noamgat

KoBART by SKT-AI

KoGPT2 by SKT-AI

GPT-3-Encoder by latitudegames

minbpe by karpathy

tokenizers by huggingface

tiktoken by openai