Discover and explore top open-source AI tools and projects—updated daily.
benbrandtRust crate for splitting text into semantic chunks
Top 59.0% on SourcePulse
This library provides robust text splitting capabilities for Large Language Models (LLMs) by dividing large documents into smaller, semantically meaningful chunks. It targets developers working with LLMs who need to manage context window limitations, offering flexible chunking strategies based on characters, tokens, or structured document formats like Markdown and code.
How It Works
The library employs a multi-level semantic splitting approach. It prioritizes larger semantic units (like sentences, paragraphs, or code blocks) that fit within the desired chunk size, falling back to smaller units (words, graphemes, characters) if necessary. This strategy aims to preserve context and coherence within each chunk, improving the effectiveness of LLM processing. It supports custom chunk sizing via Hugging Face tokenizers and tiktoken-rs for precise token-based splitting.
Quick Start & Requirements
cargo add text-splittercargo add text-splitter --features tokenizers or cargo add text-splitter --features tiktoken-rscargo add text-splitter --features markdowncargo add text-splitter --features code and cargo add tree-sitter-<language>Highlighted Details
tiktoken-rs or tokenizers), or semantic structure.MarkdownSplitter and CodeSplitter (requires tree-sitter) for structured content.min..max).icu_segmenter for Unicode-compliant word and sentence boundary detection.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The sentence splitting relies on Unicode boundary rules, which may not always align with linguistic sentence definitions, potentially impacting semantic accuracy in complex cases. The CodeSplitter requires specific tree-sitter parsers for each language.
2 days ago
1 day
chonkie-inc
karpathy
huggingface