text-splitter by benbrandt

Rust crate for splitting text into semantic chunks

Created 2 years ago

538 stars

Top 59.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Philipp Schmid

DevRel at Google DeepMind

Malte Pietsch

Cofounder of deepset

Project Summary

This library provides robust text splitting capabilities for Large Language Models (LLMs) by dividing large documents into smaller, semantically meaningful chunks. It targets developers working with LLMs who need to manage context window limitations, offering flexible chunking strategies based on characters, tokens, or structured document formats like Markdown and code.

How It Works

The library employs a multi-level semantic splitting approach. It prioritizes larger semantic units (like sentences, paragraphs, or code blocks) that fit within the desired chunk size, falling back to smaller units (words, graphemes, characters) if necessary. This strategy aims to preserve context and coherence within each chunk, improving the effectiveness of LLM processing. It supports custom chunk sizing via Hugging Face tokenizers and tiktoken-rs for precise token-based splitting.

Quick Start & Requirements

Install via Cargo: cargo add text-splitter
For tokenizers: cargo add text-splitter --features tokenizers or cargo add text-splitter --features tiktoken-rs
For Markdown: cargo add text-splitter --features markdown
For Code: cargo add text-splitter --features code and cargo add tree-sitter-<language>
Official Docs: https://docs.rs/text-splitter/latest/text_splitter/

Highlighted Details

Supports splitting by character count, token count (via tiktoken-rs or tokenizers), or semantic structure.
Offers specialized MarkdownSplitter and CodeSplitter (requires tree-sitter) for structured content.
Chunk size can be a fixed value or a range (min..max).
Leverages icu_segmenter for Unicode-compliant word and sentence boundary detection.

Maintenance & Community

Developed by benbrandt.
Inspired by LangChain's TextSplitter, aiming for improved performance and semantic chunking.

Licensing & Compatibility

MIT License.
Permissive license suitable for commercial and closed-source applications.

Limitations & Caveats

The sentence splitting relies on Unicode boundary rules, which may not always align with linguistic sentence definitions, potentially impacting semantic accuracy in complex cases. The CodeSplitter requires specific tree-sitter parsers for each language.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days