SDK for text chunking and evaluation research
Top 78.2% on sourcepulse
This package provides tools for evaluating text chunking strategies, targeting researchers and developers building AI applications that rely on retrieval augmented generation (RAG). It offers implementations of novel chunking methods and a framework for comparing their performance against custom or existing strategies, enabling data-driven selection of optimal chunking configurations.
How It Works
The library facilitates evaluation through a GeneralEvaluation
class, which orchestrates the process of chunking documents, generating embeddings (via pluggable embedding functions), and assessing retrieval quality using metrics like Intersection over Union (IoU) and recall. It supports custom chunker implementations and novel strategies like ClusterSemanticChunker
and LLMChunker
, allowing for direct comparison and analysis of their effectiveness.
Quick Start & Requirements
pip install git+https://github.com/brandonstarxel/chunking_evaluation.git
tiktoken
, fuzzywuzzy
, pandas
, numpy
, tqdm
, chromadb
, python-Levenshtein
, openai
, anthropic
, attrs
.Highlighted Details
ClusterSemanticChunker
and LLMChunker
.Maintenance & Community
dev
branch.Licensing & Compatibility
Limitations & Caveats
4 months ago
1 week