advanced-chunker  by rango-ramesh

Semantic chunker for retrieval-augmented generation (RAG) pipelines

created 3 months ago
262 stars

Top 97.8% on sourcepulse

GitHubView on GitHub
Project Summary

This package provides a semantically-aware text chunking and clustering solution for LLM pipelines and RAG systems. It addresses the limitations of traditional fixed-size chunking by intelligently grouping related ideas using sentence embeddings and clustering, leading to more coherent and relevant chunks for improved downstream model performance. The target audience includes developers and researchers working with LLMs, RAG, and knowledge processing.

How It Works

The core approach involves converting text chunks into embeddings using Sentence Transformers. It then calculates cosine similarity between these embeddings to identify semantically related chunks. Agglomerative clustering is applied based on a distance threshold, and token-aware merging is performed using real model tokenizers to respect token limits. This process results in fewer, denser chunks that preserve contextual coherence, enhancing RAG efficiency and interpretability.

Quick Start & Requirements

  • Install via pip: pip install advanced-chunker
  • Requires Python.
  • Official documentation and examples are available in the README.

Highlighted Details

  • Integrates with LangChain via SemanticChunkerSplitter.
  • Offers CLI for scripting and automation with export options to JSON, Markdown, and CSV.
  • Provides visualization tools for attention heatmaps, semantic graphs, and cluster previews.
  • Supports token-aware merging with real model tokenizers.

Maintenance & Community

  • Open to pull requests; issues should be opened for feature requests or bug fixes.
  • Mentions integrations with LangChain, Sentence Transformers, scikit-learn, and Hugging Face.

Licensing & Compatibility

  • MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The package is primarily focused on semantic chunking and clustering; advanced text preprocessing or complex NLP tasks beyond this scope are not covered. The effectiveness of the chunking is dependent on the quality of the underlying sentence embeddings.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Starred by Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

WordLlama by dleemiller

0%
1k
NLP toolkit for leveraging LLM token embeddings
created 1 year ago
updated 4 months ago
Feedback? Help us improve.