zchunk  by zeroentropy-ai

Novel LLM-powered chunking for RAG

Created 1 year ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary This project addresses the significant challenge of effective document chunking for Retrieval Augmented Generation (RAG) applications. It introduces zChunk, a novel strategy that leverages Llama 3.1 70B to automatically segment documents into semantically coherent chunks, aiming to improve retrieval accuracy and signal-to-noise ratios. zChunk offers a robust, out-of-the-box solution for RAG preprocessing, reducing the need for extensive manual tuning and custom regex development.

How It Works zChunk employs a prompt-based approach where Llama 3.1 70B is instructed to insert a special, non-corpus token (e.g., "段") at semantically meaningful boundaries within a document. This method bypasses the brittleness of regex-based splitting and the limitations of fixed-size or purely embedding-similarity-based chunking. For enhanced efficiency, zChunk utilizes low-level access to the LLM's log probabilities to identify optimal chunking points without generating full output tokens, significantly reducing inference latency. This optimization is crucial for processing large documents rapidly.

Quick Start & Requirements

  • Primary Install/Run: Not explicitly detailed. Requires local inference setup for Llama 3.1 70B.
  • Prerequisites: Llama 3.1 70B model, Python, tiktoken library. GPU acceleration is implied by benchmark performance (A100).
  • Links: No official quick-start, demo, or documentation links are provided in the README.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), and
2 more.

chonkie by chonkie-inc

0.5%
4k
Chunking library for RAG applications
Created 11 months ago
Updated 1 day ago
Feedback? Help us improve.