late-chunking  by jina-ai

Research paper code for late chunking (chunked pooling) in embedding models

created 1 year ago
426 stars

Top 70.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the implementation for "Late Chunking," a technique designed to improve the performance of Retrieval Augmented Generation (RAG) systems by addressing the challenge of long-distance contextual dependencies in text. It is targeted at developers and researchers working with RAG and large language models who need to enhance information retrieval accuracy for documents that span multiple text chunks.

How It Works

Late Chunking leverages the extended context windows of modern embedding models (e.g., 8192 tokens) by first processing larger segments of text. Instead of chunking text before embedding, it embeds the entire text (or a large portion) to generate token-level vector representations. Then, it applies mean pooling to smaller segments of these token vectors to create chunk embeddings. This approach allows embeddings for smaller chunks to incorporate information from the entire document, significantly improving the retrieval of semantically related text, especially when anaphoric references are present across chunks.

Quick Start & Requirements

  • Install dependencies: pip install .
  • Run evaluation: python3 run_chunked_eval.py --task-name {TASK_NAME} (tasks include "SciFactChunked", "TRECCOVIDChunked", etc.)
  • Requires Python 3.x.
  • Evaluation uses the jina-embeddings-v2-small-en model.
  • Further details on evaluation tasks are available in the MTEB Repository.

Highlighted Details

  • Demonstrates improved retrieval accuracy (nDCG@10) on BeIR benchmarks compared to traditional pre-chunking methods.
  • Shows significant gains in similarity scores for sentences with anaphoric references to entities mentioned in earlier chunks.
  • Performance improvements correlate with longer average document lengths.
  • Code is available to reproduce evaluation results.

Maintenance & Community

  • Code contributed by Isabelle Mohr (@violenil).
  • README reviewed by Scott Martens (@scott-martens).
  • References the paper "Jina embeddings 2: 8192-token general-purpose text embeddings for long documents."
  • Citation provided for the "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" paper.

Licensing & Compatibility

  • The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The README does not specify the license, which is a critical factor for adoption, especially in commercial or closed-source environments. While late chunking generally improves retrieval, the "no chunking" approach sometimes yields better results, particularly for datasets with shorter documents or when ranking individual chunks is not the primary goal.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
49 stars in the last 90 days

Explore Similar Projects

Starred by Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

WordLlama by dleemiller

0%
1k
NLP toolkit for leveraging LLM token embeddings
created 1 year ago
updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.