semchunk  by isaacus-dev

Python library for splitting text into semantically meaningful chunks

created 1 year ago
348 stars

Top 80.9% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides a fast, lightweight, and easy-to-use Python tool for splitting text into semantically meaningful chunks. It is designed for developers and researchers working with large text datasets, particularly in NLP applications, offering improved semantic coherence and performance over traditional chunking methods.

How It Works

semchunk employs a recursive splitting algorithm that prioritizes semantically meaningful delimiters. It uses a tiered approach, starting with the largest sequences of newlines, then tabs, whitespace, sentence terminators, clause separators, sentence interrupters, word joiners, and finally all other characters. Chunks are recursively split until they meet the specified token size, then merged if under the limit. The algorithm also handles reattaching splitters and excluding whitespace-only chunks for better semantic integrity.

Quick Start & Requirements

  • Install via pip: pip install semchunk or conda install -c conda-forge semchunk.
  • Supports OpenAI's tiktoken and Hugging Face transformers tokenizers, or custom tokenizers/counters.
  • Example usage and detailed API documentation are available in the README.

Highlighted Details

  • 85% faster than semantic-text-splitter in benchmarks.
  • Supports overlapping chunks and returning character offsets.
  • Built-in support for multiprocessing for faster chunking of multiple texts.
  • Can exclude chunks consisting entirely of whitespace.

Maintenance & Community

  • Actively maintained by Isaacus, used in their production API for legal AI models.
  • A Rust port, semchunk-rs, is maintained by @dominictarro.

Licensing & Compatibility

  • Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

  • When specifying chunk_size, users should account for special tokens added by their chosen tokenizer, as the library does not automatically deduct these.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
51 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.