semchunk by isaacus-dev

Python library for splitting text into semantically meaningful chunks

Created 2 years ago

523 stars

Top 60.2% on SourcePulse

1 Expert Loves This Project

soldni

Research Scientist at Ai2

Project Summary

This library provides a fast, lightweight, and easy-to-use Python tool for splitting text into semantically meaningful chunks. It is designed for developers and researchers working with large text datasets, particularly in NLP applications, offering improved semantic coherence and performance over traditional chunking methods.

How It Works

semchunk employs a recursive splitting algorithm that prioritizes semantically meaningful delimiters. It uses a tiered approach, starting with the largest sequences of newlines, then tabs, whitespace, sentence terminators, clause separators, sentence interrupters, word joiners, and finally all other characters. Chunks are recursively split until they meet the specified token size, then merged if under the limit. The algorithm also handles reattaching splitters and excluding whitespace-only chunks for better semantic integrity.

Quick Start & Requirements

Install via pip: pip install semchunk or conda install -c conda-forge semchunk.
Supports OpenAI's tiktoken and Hugging Face transformers tokenizers, or custom tokenizers/counters.
Example usage and detailed API documentation are available in the README.

Highlighted Details

85% faster than semantic-text-splitter in benchmarks.
Supports overlapping chunks and returning character offsets.
Built-in support for multiprocessing for faster chunking of multiple texts.
Can exclude chunks consisting entirely of whitespace.

Maintenance & Community

Actively maintained by Isaacus, used in their production API for legal AI models.
A Rust port, semchunk-rs, is maintained by @dominictarro.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

When specifying chunk_size, users should account for special tokens added by their chosen tokenizer, as the library does not automatically deduct these.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

1

Star History

15 stars in the last 30 days

Explore Similar Projects

Starred by

Matt Holt

Matt Holt(Author of Caddy).

chonky by mirth

Python library for semantic text chunking, useful in RAG systems

Created 9 months ago

Updated 2 months ago

Meta-Chunking by IAAR-Shanghai

LLM-powered text chunking for logical document segmentation

Created 1 year ago

Updated 3 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind), and

1 more.

text-splitter by benbrandt

Rust crate for splitting text into semantic chunks

Created 2 years ago

Updated 2 days ago

advanced-chunker by rango-ramesh

Semantic chunker for retrieval-augmented generation (RAG) pipelines

Created 9 months ago

Updated 9 months ago

extract-dialogue by KMnO4-zx

AI-powered dialogue extraction for building character datasets from novels

Created 2 years ago

Updated 4 months ago

vectordb by kagisearch

Python package for local, embeddings-based text retrieval

Created 2 years ago

Updated 1 year ago

late-chunking by jina-ai

Research paper code for late chunking (chunked pooling) in embedding models

Created 1 year ago

Updated 1 year ago

text-split-explorer by langchain-ai

Streamlit app for LLM data ingestion via text splitting

Created 2 years ago

Updated 2 years ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django) and

Anton Troynikov

Anton Troynikov(Cofounder of Chroma).

chunking_evaluation by brandonstarxel

SDK for text chunking and evaluation research

Created 1 year ago

Updated 4 weeks ago

Starred by

Xiaofan Luan

Xiaofan Luan(VP Engineering at Zilliz) and

Bryan Helmig

Bryan Helmig(Cofounder of Zapier).

open-parse by Filimoa

File parser for improved LLM document chunking

Created 1 year ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Carol Willing

Carol Willing(Core Contributor to CPython, Jupyter), and

2 more.

chonkie by chonkie-inc

Chunking library for RAG applications

Created 9 months ago

Updated 2 days ago

Starred by

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama),

Dan Guido

Dan Guido(Cofounder of Trail of Bits), and

2 more.

langextract by google

Extract structured data from text with LLMs

Created 6 months ago

Updated 1 week ago

Feedback? Help us improve.