toon by toon-format

Compact data format for LLMs

Created 8 months ago

24,824 stars

Top 2.0% on SourcePulse

View on GitHub

11 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Hiroshi Shibata

Core Contributor to Ruby

Yaowei Zheng

Author of LLaMA-Factory

Anurag Goel

Founder of Render

and 7 more!

Project Summary

Token-Oriented Object Notation (TOON) is a compact, human-readable data format designed to drastically reduce token usage when passing structured data to Large Language Models (LLMs). It targets developers and power users who frequently send large datasets to LLMs and seek to lower costs and improve efficiency. TOON offers a significant reduction in token count, typically between 30-60%, compared to standard JSON.

How It Works

TOON merges YAML's indentation-based structure for nested objects with CSV's tabular format for uniform data rows, optimizing for LLM contexts. It minimizes token overhead by removing redundant punctuation like braces, brackets, and most quotes, relying instead on whitespace and explicit declarations. Tabular arrays are a key feature, allowing keys to be declared once, followed by streamed rows without repetition, further enhancing token efficiency.

Quick Start & Requirements

Primary install: npm install @byjohann/toon (also supports pnpm and yarn).
Prerequisites: Requires Node.js and a package manager (npm, pnpm, or yarn). No specific hardware, OS, or GPU requirements are mentioned.

Quick Start:

import { encode } from '@byjohann/toon'
const data = {
  user: {
    id: 123,
    name: 'Ada',
    tags: ['reading', 'gaming'],
    active: true,
    preferences: []
  }
}
console.log(encode(data))

This example demonstrates encoding a nested object with primitive arrays and empty arrays, resulting in a TOON string.

Highlighted Details

Token Efficiency: Benchmarks show significant savings, with a large dataset example reducing token count by 64.7%.
LLM-Friendly Guardrails: Explicitly includes array lengths and field lists, aiding LLMs in validating generated output.
Minimal Syntax: Removes redundant characters, using indentation and whitespace for structure, improving readability and reducing token count.
Tabular Arrays: Efficiently encodes uniform arrays of objects by declaring keys once and streaming values, drastically reducing repetition.
Custom Delimiters: Supports comma (default), tab (\t), and pipe (|) delimiters for arrays, offering flexibility and potential further token savings.

Maintenance & Community

The provided README does not contain specific information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

License: MIT License.
Compatibility: The MIT license is highly permissive, making TOON suitable for use in commercial and closed-source applications. It is designed as a format for LLM prompts rather than a direct replacement for JSON in standard APIs or data storage.

Limitations & Caveats

Token savings are dependent on the specific LLM tokenizer used; benchmarks are based on GPT-style tokenizers. The efficient tabular array format requires all objects within an array to have identical key sets and only primitive values; deviations will cause TOON to fall back to a more verbose list format. TOON is optimized for LLM contexts and is not a direct drop-in replacement for JSON in general-purpose programming scenarios.

Health Check

Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

341 stars in the last 30 days