chunking_evaluation  by brandonstarxel

SDK for text chunking and evaluation research

created 1 year ago
365 stars

Top 78.2% on sourcepulse

GitHubView on GitHub
Project Summary

This package provides tools for evaluating text chunking strategies, targeting researchers and developers building AI applications that rely on retrieval augmented generation (RAG). It offers implementations of novel chunking methods and a framework for comparing their performance against custom or existing strategies, enabling data-driven selection of optimal chunking configurations.

How It Works

The library facilitates evaluation through a GeneralEvaluation class, which orchestrates the process of chunking documents, generating embeddings (via pluggable embedding functions), and assessing retrieval quality using metrics like Intersection over Union (IoU) and recall. It supports custom chunker implementations and novel strategies like ClusterSemanticChunker and LLMChunker, allowing for direct comparison and analysis of their effectiveness.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/brandonstarxel/chunking_evaluation.git
  • Prerequisites: OpenAI API key for embedding functions.
  • Demo: Google Colab
  • Dependencies: tiktoken, fuzzywuzzy, pandas, numpy, tqdm, chromadb, python-Levenshtein, openai, anthropic, attrs.

Highlighted Details

  • Includes novel chunking methods: ClusterSemanticChunker and LLMChunker.
  • Provides a synthetic dataset pipeline for domain-specific evaluation.
  • Supports custom chunker and embedding function integration.
  • Evaluates chunking performance using metrics like IoU and recall.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Requires an OpenAI API key for core functionality, potentially incurring costs.
  • The lack of an explicit license may pose compatibility issues for commercial or closed-source use.
Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
3
Issues (30d)
0
Star History
65 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.