chonky by mirth

Python library for semantic text chunking, useful in RAG systems

Created 9 months ago

404 stars

Top 71.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Matt Holt

Author of Caddy

Project Summary

Mirth/chonky provides a fully neural approach to text chunking, leveraging fine-tuned transformer models to segment text into semantically meaningful units. This library is particularly useful for Retrieval Augmented Generation (RAG) systems, offering an intelligent alternative to traditional rule-based or fixed-size chunking methods.

How It Works

Chonky utilizes transformer models, specifically variants of BERT and DistilBERT, to predict optimal text segmentation points. The core idea is to identify semantic boundaries within the text, aiming to create chunks that are coherent and contextually relevant. This neural approach allows for more nuanced segmentation compared to fixed-size or simple delimiter-based methods, potentially improving the performance of downstream NLP tasks like RAG.

Quick Start & Requirements

Install via pip: pip install chonky
Requires Python. Transformer models are downloaded on first run.
Supports CPU and GPU.
Official documentation and demo are not explicitly linked in the README.

Highlighted Details

Offers multiple transformer models with varying parameter counts and sequence lengths (e.g., mirth/chonky_modernbert_large_1 with 396M parameters and 1024 sequence length).
Includes a MarkupRemover helper class to strip HTML, XML, and Markdown tags before chunking.
Benchmarks show competitive F1 scores on datasets like bookcorpus and paul_graham, outperforming several other chunking methods, including some from Langchain and LlamaIndex.

Maintenance & Community

The project is maintained by "mirth".
No specific community channels (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

Benchmarks are character-based and computed on a limited dataset size (1M characters), which may not fully represent real-world performance.
The README notes that the bookcorpus dataset used for benchmarking is also the Chonky validation set, potentially introducing bias.
No explicit mention of support for languages other than English.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days