advanced-chunker by rango-ramesh

Semantic chunker for retrieval-augmented generation (RAG) pipelines

Created 9 months ago

290 stars

Top 91.0% on SourcePulse

Project Summary

This package provides a semantically-aware text chunking and clustering solution for LLM pipelines and RAG systems. It addresses the limitations of traditional fixed-size chunking by intelligently grouping related ideas using sentence embeddings and clustering, leading to more coherent and relevant chunks for improved downstream model performance. The target audience includes developers and researchers working with LLMs, RAG, and knowledge processing.

How It Works

The core approach involves converting text chunks into embeddings using Sentence Transformers. It then calculates cosine similarity between these embeddings to identify semantically related chunks. Agglomerative clustering is applied based on a distance threshold, and token-aware merging is performed using real model tokenizers to respect token limits. This process results in fewer, denser chunks that preserve contextual coherence, enhancing RAG efficiency and interpretability.

Quick Start & Requirements

Install via pip: pip install advanced-chunker
Requires Python.
Official documentation and examples are available in the README.

Highlighted Details

Integrates with LangChain via SemanticChunkerSplitter.
Offers CLI for scripting and automation with export options to JSON, Markdown, and CSV.
Provides visualization tools for attention heatmaps, semantic graphs, and cluster previews.
Supports token-aware merging with real model tokenizers.

Maintenance & Community

Open to pull requests; issues should be opened for feature requests or bug fixes.
Mentions integrations with LangChain, Sentence Transformers, scikit-learn, and Hugging Face.

Licensing & Compatibility

MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The package is primarily focused on semantic chunking and clustering; advanced text preprocessing or complex NLP tasks beyond this scope are not covered. The effectiveness of the chunking is dependent on the quality of the underlying sentence embeddings.

advanced-chunker by rango-ramesh

Explore Similar Projects

text-splitter by benbrandt

semchunk by isaacus-dev

late-chunking by jina-ai

super-rag by superagent-ai

embedJs by llm-tools

chunking_evaluation by brandonstarxel

dsRAG by D-Star-AI

open-parse by Filimoa

chonkie by chonkie-inc

pdfGPT by bhaskatripathi

ragflow by infiniflow

funNLP by fighting41love