Discover and explore top open-source AI tools and projects—updated daily.
mirthPython library for semantic text chunking, useful in RAG systems
Top 73.2% on SourcePulse
Mirth/chonky provides a fully neural approach to text chunking, leveraging fine-tuned transformer models to segment text into semantically meaningful units. This library is particularly useful for Retrieval Augmented Generation (RAG) systems, offering an intelligent alternative to traditional rule-based or fixed-size chunking methods.
How It Works
Chonky utilizes transformer models, specifically variants of BERT and DistilBERT, to predict optimal text segmentation points. The core idea is to identify semantic boundaries within the text, aiming to create chunks that are coherent and contextually relevant. This neural approach allows for more nuanced segmentation compared to fixed-size or simple delimiter-based methods, potentially improving the performance of downstream NLP tasks like RAG.
Quick Start & Requirements
pip install chonkyHighlighted Details
mirth/chonky_modernbert_large_1 with 396M parameters and 1024 sequence length).MarkupRemover helper class to strip HTML, XML, and Markdown tags before chunking.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 week ago
Inactive
guillaume-be
chonkie-inc