chunking_evaluation  by brandonstarxel

SDK for text chunking and evaluation research

Created 1 year ago
464 stars

Top 65.3% on SourcePulse

GitHubView on GitHub
Project Summary

This package provides tools for evaluating text chunking strategies, targeting researchers and developers building AI applications that rely on retrieval augmented generation (RAG). It offers implementations of novel chunking methods and a framework for comparing their performance against custom or existing strategies, enabling data-driven selection of optimal chunking configurations.

How It Works

The library facilitates evaluation through a GeneralEvaluation class, which orchestrates the process of chunking documents, generating embeddings (via pluggable embedding functions), and assessing retrieval quality using metrics like Intersection over Union (IoU) and recall. It supports custom chunker implementations and novel strategies like ClusterSemanticChunker and LLMChunker, allowing for direct comparison and analysis of their effectiveness.

Quick Start & Requirements

  • Installation: pip install git+https://github.com/brandonstarxel/chunking_evaluation.git
  • Prerequisites: OpenAI API key for embedding functions.
  • Demo: Google Colab
  • Dependencies: tiktoken, fuzzywuzzy, pandas, numpy, tqdm, chromadb, python-Levenshtein, openai, anthropic, attrs.

Highlighted Details

  • Includes novel chunking methods: ClusterSemanticChunker and LLMChunker.
  • Provides a synthetic dataset pipeline for domain-specific evaluation.
  • Supports custom chunker and embedding function integration.
  • Evaluates chunking performance using metrics like IoU and recall.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Requires an OpenAI API key for core functionality, potentially incurring costs.
  • The lack of an explicit license may pose compatibility issues for commercial or closed-source use.
Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
1 more.

text-splitter by benbrandt

0.7%
538
Rust crate for splitting text into semantic chunks
Created 2 years ago
Updated 2 days ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

chonkie by chonkie-inc

2.9%
4k
Chunking library for RAG applications
Created 9 months ago
Updated 2 days ago
Feedback? Help us improve.