late-chunking by jina-ai

Research paper code for late chunking (chunked pooling) in embedding models

Created 1 year ago

482 stars

Top 63.6% on SourcePulse

Project Summary

This repository provides the implementation for "Late Chunking," a technique designed to improve the performance of Retrieval Augmented Generation (RAG) systems by addressing the challenge of long-distance contextual dependencies in text. It is targeted at developers and researchers working with RAG and large language models who need to enhance information retrieval accuracy for documents that span multiple text chunks.

How It Works

Late Chunking leverages the extended context windows of modern embedding models (e.g., 8192 tokens) by first processing larger segments of text. Instead of chunking text before embedding, it embeds the entire text (or a large portion) to generate token-level vector representations. Then, it applies mean pooling to smaller segments of these token vectors to create chunk embeddings. This approach allows embeddings for smaller chunks to incorporate information from the entire document, significantly improving the retrieval of semantically related text, especially when anaphoric references are present across chunks.

Quick Start & Requirements

Install dependencies: pip install .
Run evaluation: python3 run_chunked_eval.py --task-name {TASK_NAME} (tasks include "SciFactChunked", "TRECCOVIDChunked", etc.)
Requires Python 3.x.
Evaluation uses the jina-embeddings-v2-small-en model.
Further details on evaluation tasks are available in the MTEB Repository.

Highlighted Details

Demonstrates improved retrieval accuracy (nDCG@10) on BeIR benchmarks compared to traditional pre-chunking methods.
Shows significant gains in similarity scores for sentences with anaphoric references to entities mentioned in earlier chunks.
Performance improvements correlate with longer average document lengths.
Code is available to reproduce evaluation results.

Maintenance & Community

Code contributed by Isabelle Mohr (@violenil).
README reviewed by Scott Martens (@scott-martens).
References the paper "Jina embeddings 2: 8192-token general-purpose text embeddings for long documents."
Citation provided for the "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" paper.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The README does not specify the license, which is a critical factor for adoption, especially in commercial or closed-source environments. While late chunking generally improves retrieval, the "no chunking" approach sometimes yields better results, particularly for datasets with shorter documents or when ranking individual chunks is not the primary goal.

late-chunking by jina-ai

Explore Similar Projects

text-splitter by benbrandt

advanced-chunker by rango-ramesh

PdfGptIndexer by raghavan

vectordb by kagisearch

ACE by Alibaba-NLP

WordLlama by dleemiller

embedJs by llm-tools

finetune-embedding by run-llama

uniem by wangyuxinwhy

chunking_evaluation by brandonstarxel

RETRO-pytorch by lucidrains

chonkie by chonkie-inc