Rust crate for splitting text into semantic chunks
Top 66.7% on sourcepulse
This library provides robust text splitting capabilities for Large Language Models (LLMs) by dividing large documents into smaller, semantically meaningful chunks. It targets developers working with LLMs who need to manage context window limitations, offering flexible chunking strategies based on characters, tokens, or structured document formats like Markdown and code.
How It Works
The library employs a multi-level semantic splitting approach. It prioritizes larger semantic units (like sentences, paragraphs, or code blocks) that fit within the desired chunk size, falling back to smaller units (words, graphemes, characters) if necessary. This strategy aims to preserve context and coherence within each chunk, improving the effectiveness of LLM processing. It supports custom chunk sizing via Hugging Face tokenizers
and tiktoken-rs
for precise token-based splitting.
Quick Start & Requirements
cargo add text-splitter
cargo add text-splitter --features tokenizers
or cargo add text-splitter --features tiktoken-rs
cargo add text-splitter --features markdown
cargo add text-splitter --features code
and cargo add tree-sitter-<language>
Highlighted Details
tiktoken-rs
or tokenizers
), or semantic structure.MarkdownSplitter
and CodeSplitter
(requires tree-sitter
) for structured content.min..max
).icu_segmenter
for Unicode-compliant word and sentence boundary detection.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The sentence splitting relies on Unicode boundary rules, which may not always align with linguistic sentence definitions, potentially impacting semantic accuracy in complex cases. The CodeSplitter
requires specific tree-sitter
parsers for each language.
1 day ago
1 day