toon  by toon-format

Compact data format for LLMs

Created 1 week ago

New!

9,117 stars

Top 5.6% on SourcePulse

GitHubView on GitHub
Project Summary

Token-Oriented Object Notation (TOON) is a compact, human-readable data format designed to drastically reduce token usage when passing structured data to Large Language Models (LLMs). It targets developers and power users who frequently send large datasets to LLMs and seek to lower costs and improve efficiency. TOON offers a significant reduction in token count, typically between 30-60%, compared to standard JSON.

How It Works

TOON merges YAML's indentation-based structure for nested objects with CSV's tabular format for uniform data rows, optimizing for LLM contexts. It minimizes token overhead by removing redundant punctuation like braces, brackets, and most quotes, relying instead on whitespace and explicit declarations. Tabular arrays are a key feature, allowing keys to be declared once, followed by streamed rows without repetition, further enhancing token efficiency.

Quick Start & Requirements

  • Primary install: npm install @byjohann/toon (also supports pnpm and yarn).
  • Prerequisites: Requires Node.js and a package manager (npm, pnpm, or yarn). No specific hardware, OS, or GPU requirements are mentioned.
  • Quick Start:
    import { encode } from '@byjohann/toon'
    const data = {
      user: {
        id: 123,
        name: 'Ada',
        tags: ['reading', 'gaming'],
        active: true,
        preferences: []
      }
    }
    console.log(encode(data))
    
    This example demonstrates encoding a nested object with primitive arrays and empty arrays, resulting in a TOON string.

Highlighted Details

  • Token Efficiency: Benchmarks show significant savings, with a large dataset example reducing token count by 64.7%.
  • LLM-Friendly Guardrails: Explicitly includes array lengths and field lists, aiding LLMs in validating generated output.
  • Minimal Syntax: Removes redundant characters, using indentation and whitespace for structure, improving readability and reducing token count.
  • Tabular Arrays: Efficiently encodes uniform arrays of objects by declaring keys once and streaming values, drastically reducing repetition.
  • Custom Delimiters: Supports comma (default), tab (\t), and pipe (|) delimiters for arrays, offering flexibility and potential further token savings.

Maintenance & Community

The provided README does not contain specific information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The MIT license is highly permissive, making TOON suitable for use in commercial and closed-source applications. It is designed as a format for LLM prompts rather than a direct replacement for JSON in standard APIs or data storage.

Limitations & Caveats

Token savings are dependent on the specific LLM tokenizer used; benchmarks are based on GPT-style tokenizers. The efficient tabular array format requires all objects within an array to have identical key sets and only primitive values; deviations will cause TOON to fall back to a more verbose list format. TOON is optimized for LLM contexts and is not a direct drop-in replacement for JSON in general-purpose programming scenarios.

Health Check
Last Commit

7 hours ago

Responsiveness

Inactive

Pull Requests (30d)
49
Issues (30d)
66
Star History
9,532 stars in the last 12 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

lm-format-enforcer by noamgat

0.1%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.