Python library for semantic text chunking, useful in RAG systems
Top 78.0% on sourcepulse
Mirth/chonky provides a fully neural approach to text chunking, leveraging fine-tuned transformer models to segment text into semantically meaningful units. This library is particularly useful for Retrieval Augmented Generation (RAG) systems, offering an intelligent alternative to traditional rule-based or fixed-size chunking methods.
How It Works
Chonky utilizes transformer models, specifically variants of BERT and DistilBERT, to predict optimal text segmentation points. The core idea is to identify semantic boundaries within the text, aiming to create chunks that are coherent and contextually relevant. This neural approach allows for more nuanced segmentation compared to fixed-size or simple delimiter-based methods, potentially improving the performance of downstream NLP tasks like RAG.
Quick Start & Requirements
pip install chonky
Highlighted Details
mirth/chonky_modernbert_large_1
with 396M parameters and 1024 sequence length).MarkupRemover
helper class to strip HTML, XML, and Markdown tags before chunking.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
3 months ago
Inactive