chunking_evaluation by brandonstarxel

SDK for text chunking and evaluation research

Created 1 year ago

464 stars

Top 65.3% on SourcePulse

2 Experts Love This Project

simonw

Coauthor of Django

atroyn

Anton Troynikov

Cofounder of Chroma

Project Summary

This package provides tools for evaluating text chunking strategies, targeting researchers and developers building AI applications that rely on retrieval augmented generation (RAG). It offers implementations of novel chunking methods and a framework for comparing their performance against custom or existing strategies, enabling data-driven selection of optimal chunking configurations.

How It Works

The library facilitates evaluation through a GeneralEvaluation class, which orchestrates the process of chunking documents, generating embeddings (via pluggable embedding functions), and assessing retrieval quality using metrics like Intersection over Union (IoU) and recall. It supports custom chunker implementations and novel strategies like ClusterSemanticChunker and LLMChunker, allowing for direct comparison and analysis of their effectiveness.

Quick Start & Requirements

Installation: pip install git+https://github.com/brandonstarxel/chunking_evaluation.git
Prerequisites: OpenAI API key for embedding functions.
Demo: Google Colab
Dependencies: tiktoken, fuzzywuzzy, pandas, numpy, tqdm, chromadb, python-Levenshtein, openai, anthropic, attrs.

Highlighted Details

Includes novel chunking methods: ClusterSemanticChunker and LLMChunker.
Provides a synthetic dataset pipeline for domain-specific evaluation.
Supports custom chunker and embedding function integration.
Evaluates chunking performance using metrics like IoU and recall.

Maintenance & Community

Contributions are welcomed via pull requests to the dev branch.
Research is detailed in the Chroma Technical Report: Evaluating Chunking Strategies for Retrieval.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Requires an OpenAI API key for core functionality, potentially incurring costs.
The lack of an explicit license may pose compatibility issues for commercial or closed-source use.

Health Check

Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

1

Issues (30d)

1

Star History

9 stars in the last 30 days

Explore Similar Projects

Meta-Chunking by IAAR-Shanghai

LLM-powered text chunking for logical document segmentation

Created 1 year ago

Updated 3 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind), and

1 more.

text-splitter by benbrandt

Rust crate for splitting text into semantic chunks

Created 2 years ago

Updated 2 days ago

advanced-chunker by rango-ramesh

Semantic chunker for retrieval-augmented generation (RAG) pipelines

Created 9 months ago

Updated 9 months ago

Starred by

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2).

semchunk by isaacus-dev

Python library for splitting text into semantically meaningful chunks

Created 2 years ago

Updated 2 months ago

embedding_rerank_retrieval by percent4

RAG evaluation for retrieval algorithms, using LlamaIndex

Created 2 years ago

Updated 5 months ago

vectordb by kagisearch

Python package for local, embeddings-based text retrieval

Created 2 years ago

Updated 1 year ago

late-chunking by jina-ai

Research paper code for late chunking (chunked pooling) in embedding models

Created 1 year ago

Updated 1 year ago

Starred by

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

dsRAG by D-Star-AI

RAG engine for unstructured data, excelling on dense text QA

Created 1 year ago

Updated 2 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Carol Willing

Carol Willing(Core Contributor to CPython, Jupyter), and

2 more.

chonkie by chonkie-inc

Chunking library for RAG applications

Created 9 months ago

Updated 2 days ago

Starred by

Clarence Chio

Clarence Chio(Cofounder of Coverbase, Unit21) and

Jasper Zhang

Jasper Zhang(Cofounder of Hyperbolic).

pdfGPT by bhaskatripathi

PDF chatbot for interacting with PDF content

Created 2 years ago

Updated 10 months ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Yiran Wu

Yiran Wu(Coauthor of AutoGen), and

2 more.

RAG_Techniques by NirDiamant

RAG techniques showcase for enhanced generation systems

Created 1 year ago

Updated 1 month ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

9 more.

ragflow by infiniflow

Open-source RAG engine for deep document understanding

Created 2 years ago

Updated 1 day ago

Feedback? Help us improve.