text-splitter  by benbrandt

Rust crate for splitting text into semantic chunks

created 2 years ago
461 stars

Top 66.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides robust text splitting capabilities for Large Language Models (LLMs) by dividing large documents into smaller, semantically meaningful chunks. It targets developers working with LLMs who need to manage context window limitations, offering flexible chunking strategies based on characters, tokens, or structured document formats like Markdown and code.

How It Works

The library employs a multi-level semantic splitting approach. It prioritizes larger semantic units (like sentences, paragraphs, or code blocks) that fit within the desired chunk size, falling back to smaller units (words, graphemes, characters) if necessary. This strategy aims to preserve context and coherence within each chunk, improving the effectiveness of LLM processing. It supports custom chunk sizing via Hugging Face tokenizers and tiktoken-rs for precise token-based splitting.

Quick Start & Requirements

  • Install via Cargo: cargo add text-splitter
  • For tokenizers: cargo add text-splitter --features tokenizers or cargo add text-splitter --features tiktoken-rs
  • For Markdown: cargo add text-splitter --features markdown
  • For Code: cargo add text-splitter --features code and cargo add tree-sitter-<language>
  • Official Docs: https://docs.rs/text-splitter/latest/text_splitter/

Highlighted Details

  • Supports splitting by character count, token count (via tiktoken-rs or tokenizers), or semantic structure.
  • Offers specialized MarkdownSplitter and CodeSplitter (requires tree-sitter) for structured content.
  • Chunk size can be a fixed value or a range (min..max).
  • Leverages icu_segmenter for Unicode-compliant word and sentence boundary detection.

Maintenance & Community

  • Developed by benbrandt.
  • Inspired by LangChain's TextSplitter, aiming for improved performance and semantic chunking.

Licensing & Compatibility

  • MIT License.
  • Permissive license suitable for commercial and closed-source applications.

Limitations & Caveats

The sentence splitting relies on Unicode boundary rules, which may not always align with linguistic sentence definitions, potentially impacting semantic accuracy in complex cases. The CodeSplitter requires specific tree-sitter parsers for each language.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
28
Issues (30d)
1
Star History
48 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Andreas Jansson Andreas Jansson(Cofounder of Replicate), and
1 more.

lm-format-enforcer by noamgat

0.2%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
created 1 year ago
updated 5 months ago
Feedback? Help us improve.