ttok  by simonw

CLI tool for counting and truncating text based on tokens

created 2 years ago
365 stars

Top 78.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ttok is a command-line utility for counting and truncating text based on token counts, primarily for use with Large Language Models (LLMs). It leverages OpenAI's tiktoken library, making it useful for developers and researchers working with LLM APIs that have token-based pricing or context window limits.

How It Works

The tool utilizes the tiktoken library to encode text into integer token IDs, mirroring how LLMs process input. It supports various OpenAI models by allowing users to specify the model via the -m flag, ensuring accurate tokenization for different LLM architectures. The core functionality includes counting tokens in provided text or piped input and truncating text to a specified token limit using the -t flag.

Quick Start & Requirements

  • Install via pip: pip install ttok
  • Install via Homebrew: brew install simonw/llm/ttok
  • Requires Python.

Highlighted Details

  • Supports token counting and truncation for various LLM models (GPT-4, GPT-3.5, GPT-2, etc.).
  • Can display raw token IDs (--encode) and decode them back to text (--decode).
  • Allows detailed token breakdown (--tokens).
  • Can append tokens from arguments to piped input (-i -).

Maintenance & Community

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The tool relies on the tiktoken library, meaning its accuracy is tied to the library's updates and support for specific models. No specific limitations are mentioned in the README regarding unsupported platforms or known bugs.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
1
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

instructor-js by 567-labs

0%
738
Typescript tool for structured extraction from LLMs
created 1 year ago
updated 6 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Andreas Jansson Andreas Jansson(Cofounder of Replicate), and
1 more.

lm-format-enforcer by noamgat

0.2%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
created 1 year ago
updated 5 months ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

minbpe by karpathy

0.2%
10k
Minimal BPE encoder/decoder for LLM tokenization
created 1 year ago
updated 1 year ago
Feedback? Help us improve.