Meta-Chunking by IAAR-Shanghai

LLM-powered text chunking for logical document segmentation

Created 1 year ago

268 stars

Top 95.9% on SourcePulse

Project Summary

Summary

Meta-Chunking addresses text segmentation and semantic completion for LLMs by learning efficient, logically coherent document partitioning. It targets researchers and developers seeking to improve RAG systems and content clarity. The core benefit is dynamic chunk granularity, enhancing retrieval relevance and idea integrity by ensuring each segment captures a complete thought.

How It Works

This approach leverages LLMs' logical perception to dynamically partition documents into independent chunks. It prioritizes variable chunk sizes, maintaining logical integrity by splitting text at points of high model certainty (low perplexity) and preserving continuity at points of uncertainty. Techniques like perplexity-based chunking, margin sampling, and a three-stage rewriting/two-stage summarization process repair semantic discontinuities, aiming for more coherent chunks for RAG systems, even with SLMs.

Quick Start & Requirements

Installation: Requires Python 3.10 and a Conda environment (conda create -n MetaChunking python=3.10, conda activate MetaChunking). Install dependencies via pip install -r requirements.txt. The core package is available as pip install lmchunker.
Demo: Run python app.py for a Gradio interface.
Usage: Refer to tools/lmchunker_eval.ipynb and tools/lmchunker_usage.ipynb.
Benchmarks: Supports CRUD, LongBench, MultiHop-RAG, RAGBench. Datasets and evaluation methods are available via GitHub links.
Resources: Downloadable datasets (meta-chunking.zip, summary_rewrite.zip) and configuration instructions (Instructions.md) are provided.

Highlighted Details

PPL Chunking: Utilizes KV caching for efficient segmentation of short and long documents.
Margin Sampling Chunking: Employs binary classification on sentence pairs to determine segmentation points.
Dynamic Combination: A strategy to balance fine-grained and coarse-grained chunking requirements.
LumberChunker & Dense X Retrieval: Refactored into convenient interfaces, enabling integration with local small models and dense retrieval systems.
lmchunker Package: A dedicated Python package for LLM chunking, simplifying integration.
MoC Approach: Introduces a novel method for intelligent text processing.
SLM Performance: Demonstrates high-quality chunking feasibility on Small Language Models.

Maintenance & Community

The project is under active development, with a stated intention to evolve into a plug-and-play library and regularly incorporate new chunking strategies. No specific community channels (e.g., Discord, Slack) or contributor details are provided in the README.

Licensing & Compatibility

The README does not specify a software license. Users should verify licensing terms before adoption, especially for commercial or closed-source integration.

Limitations & Caveats

The project is presented as a research artifact with an accompanying library, indicating ongoing development. Some features are marked as "Todo" or are planned for future reconstruction. While demonstrating feasibility on SLMs, the library's rapid evolution may lead to API changes. Specific limitations regarding unsupported platforms or known bugs are not detailed.

Meta-Chunking by IAAR-Shanghai

Explore Similar Projects

text-splitter by benbrandt

advanced-chunker by rango-ramesh

semchunk by isaacus-dev

vectordb by kagisearch

late-chunking by jina-ai

text-split-explorer by langchain-ai

chunking_evaluation by brandonstarxel

dsRAG by D-Star-AI

llmsherpa by nlmatics

chonkie by chonkie-inc

PageIndex by VectifyAI

ragflow by infiniflow