Python library for splitting text into semantically meaningful chunks
Top 80.9% on sourcepulse
This library provides a fast, lightweight, and easy-to-use Python tool for splitting text into semantically meaningful chunks. It is designed for developers and researchers working with large text datasets, particularly in NLP applications, offering improved semantic coherence and performance over traditional chunking methods.
How It Works
semchunk employs a recursive splitting algorithm that prioritizes semantically meaningful delimiters. It uses a tiered approach, starting with the largest sequences of newlines, then tabs, whitespace, sentence terminators, clause separators, sentence interrupters, word joiners, and finally all other characters. Chunks are recursively split until they meet the specified token size, then merged if under the limit. The algorithm also handles reattaching splitters and excluding whitespace-only chunks for better semantic integrity.
Quick Start & Requirements
pip install semchunk
or conda install -c conda-forge semchunk
.tiktoken
and Hugging Face transformers
tokenizers, or custom tokenizers/counters.Highlighted Details
semantic-text-splitter
in benchmarks.Maintenance & Community
semchunk-rs
, is maintained by @dominictarro.Licensing & Compatibility
Limitations & Caveats
chunk_size
, users should account for special tokens added by their chosen tokenizer, as the library does not automatically deduct these.1 month ago
1 day