semchunk  by isaacus-dev

Python library for splitting text into semantically meaningful chunks

Created 2 years ago
523 stars

Top 60.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides a fast, lightweight, and easy-to-use Python tool for splitting text into semantically meaningful chunks. It is designed for developers and researchers working with large text datasets, particularly in NLP applications, offering improved semantic coherence and performance over traditional chunking methods.

How It Works

semchunk employs a recursive splitting algorithm that prioritizes semantically meaningful delimiters. It uses a tiered approach, starting with the largest sequences of newlines, then tabs, whitespace, sentence terminators, clause separators, sentence interrupters, word joiners, and finally all other characters. Chunks are recursively split until they meet the specified token size, then merged if under the limit. The algorithm also handles reattaching splitters and excluding whitespace-only chunks for better semantic integrity.

Quick Start & Requirements

  • Install via pip: pip install semchunk or conda install -c conda-forge semchunk.
  • Supports OpenAI's tiktoken and Hugging Face transformers tokenizers, or custom tokenizers/counters.
  • Example usage and detailed API documentation are available in the README.

Highlighted Details

  • 85% faster than semantic-text-splitter in benchmarks.
  • Supports overlapping chunks and returning character offsets.
  • Built-in support for multiprocessing for faster chunking of multiple texts.
  • Can exclude chunks consisting entirely of whitespace.

Maintenance & Community

  • Actively maintained by Isaacus, used in their production API for legal AI models.
  • A Rust port, semchunk-rs, is maintained by @dominictarro.

Licensing & Compatibility

  • Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

  • When specifying chunk_size, users should account for special tokens added by their chosen tokenizer, as the library does not automatically deduct these.
Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
1 more.

text-splitter by benbrandt

0.7%
538
Rust crate for splitting text into semantic chunks
Created 2 years ago
Updated 2 days ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

chonkie by chonkie-inc

2.9%
4k
Chunking library for RAG applications
Created 9 months ago
Updated 2 days ago
Feedback? Help us improve.