ttok  by simonw

CLI tool for counting and truncating text based on tokens

Created 2 years ago
374 stars

Top 75.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ttok is a command-line utility for counting and truncating text based on token counts, primarily for use with Large Language Models (LLMs). It leverages OpenAI's tiktoken library, making it useful for developers and researchers working with LLM APIs that have token-based pricing or context window limits.

How It Works

The tool utilizes the tiktoken library to encode text into integer token IDs, mirroring how LLMs process input. It supports various OpenAI models by allowing users to specify the model via the -m flag, ensuring accurate tokenization for different LLM architectures. The core functionality includes counting tokens in provided text or piped input and truncating text to a specified token limit using the -t flag.

Quick Start & Requirements

  • Install via pip: pip install ttok
  • Install via Homebrew: brew install simonw/llm/ttok
  • Requires Python.

Highlighted Details

  • Supports token counting and truncation for various LLM models (GPT-4, GPT-3.5, GPT-2, etc.).
  • Can display raw token IDs (--encode) and decode them back to text (--decode).
  • Allows detailed token breakdown (--tokens).
  • Can append tokens from arguments to piped input (-i -).

Maintenance & Community

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The tool relies on the tiktoken library, meaning its accuracy is tied to the library's updates and support for specific models. No specific limitations are mentioned in the README regarding unsupported platforms or known bugs.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

lm-format-enforcer by noamgat

0.6%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
Created 2 years ago
Updated 3 weeks ago
Starred by Simon Willison Simon Willison(Coauthor of Django), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
2 more.

GPT-3-Encoder by latitudegames

0%
721
JS library for GPT-2/GPT-3 text tokenization
Created 5 years ago
Updated 2 years ago
Feedback? Help us improve.