late-chunking  by jina-ai

Research paper code for late chunking (chunked pooling) in embedding models

Created 1 year ago
450 stars

Top 66.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the implementation for "Late Chunking," a technique designed to improve the performance of Retrieval Augmented Generation (RAG) systems by addressing the challenge of long-distance contextual dependencies in text. It is targeted at developers and researchers working with RAG and large language models who need to enhance information retrieval accuracy for documents that span multiple text chunks.

How It Works

Late Chunking leverages the extended context windows of modern embedding models (e.g., 8192 tokens) by first processing larger segments of text. Instead of chunking text before embedding, it embeds the entire text (or a large portion) to generate token-level vector representations. Then, it applies mean pooling to smaller segments of these token vectors to create chunk embeddings. This approach allows embeddings for smaller chunks to incorporate information from the entire document, significantly improving the retrieval of semantically related text, especially when anaphoric references are present across chunks.

Quick Start & Requirements

  • Install dependencies: pip install .
  • Run evaluation: python3 run_chunked_eval.py --task-name {TASK_NAME} (tasks include "SciFactChunked", "TRECCOVIDChunked", etc.)
  • Requires Python 3.x.
  • Evaluation uses the jina-embeddings-v2-small-en model.
  • Further details on evaluation tasks are available in the MTEB Repository.

Highlighted Details

  • Demonstrates improved retrieval accuracy (nDCG@10) on BeIR benchmarks compared to traditional pre-chunking methods.
  • Shows significant gains in similarity scores for sentences with anaphoric references to entities mentioned in earlier chunks.
  • Performance improvements correlate with longer average document lengths.
  • Code is available to reproduce evaluation results.

Maintenance & Community

  • Code contributed by Isabelle Mohr (@violenil).
  • README reviewed by Scott Martens (@scott-martens).
  • References the paper "Jina embeddings 2: 8192-token general-purpose text embeddings for long documents."
  • Citation provided for the "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" paper.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The README does not specify the license, which is a critical factor for adoption, especially in commercial or closed-source environments. While late chunking generally improves retrieval, the "no chunking" approach sometimes yields better results, particularly for datasets with shorter documents or when ranking individual chunks is not the primary goal.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.