chonky  by mirth

Python library for semantic text chunking, useful in RAG systems

created 3 months ago
367 stars

Top 78.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Mirth/chonky provides a fully neural approach to text chunking, leveraging fine-tuned transformer models to segment text into semantically meaningful units. This library is particularly useful for Retrieval Augmented Generation (RAG) systems, offering an intelligent alternative to traditional rule-based or fixed-size chunking methods.

How It Works

Chonky utilizes transformer models, specifically variants of BERT and DistilBERT, to predict optimal text segmentation points. The core idea is to identify semantic boundaries within the text, aiming to create chunks that are coherent and contextually relevant. This neural approach allows for more nuanced segmentation compared to fixed-size or simple delimiter-based methods, potentially improving the performance of downstream NLP tasks like RAG.

Quick Start & Requirements

  • Install via pip: pip install chonky
  • Requires Python. Transformer models are downloaded on first run.
  • Supports CPU and GPU.
  • Official documentation and demo are not explicitly linked in the README.

Highlighted Details

  • Offers multiple transformer models with varying parameter counts and sequence lengths (e.g., mirth/chonky_modernbert_large_1 with 396M parameters and 1024 sequence length).
  • Includes a MarkupRemover helper class to strip HTML, XML, and Markdown tags before chunking.
  • Benchmarks show competitive F1 scores on datasets like bookcorpus and paul_graham, outperforming several other chunking methods, including some from Langchain and LlamaIndex.

Maintenance & Community

  • The project is maintained by "mirth".
  • No specific community channels (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • Benchmarks are character-based and computed on a limited dataset size (1M characters), which may not fully represent real-world performance.
  • The README notes that the bookcorpus dataset used for benchmarking is also the Chonky validation set, potentially introducing bias.
  • No explicit mention of support for languages other than English.
Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.