advanced-chunker  by rango-ramesh

Semantic chunker for retrieval-augmented generation (RAG) pipelines

Created 5 months ago
270 stars

Top 95.2% on SourcePulse

GitHubView on GitHub
Project Summary

This package provides a semantically-aware text chunking and clustering solution for LLM pipelines and RAG systems. It addresses the limitations of traditional fixed-size chunking by intelligently grouping related ideas using sentence embeddings and clustering, leading to more coherent and relevant chunks for improved downstream model performance. The target audience includes developers and researchers working with LLMs, RAG, and knowledge processing.

How It Works

The core approach involves converting text chunks into embeddings using Sentence Transformers. It then calculates cosine similarity between these embeddings to identify semantically related chunks. Agglomerative clustering is applied based on a distance threshold, and token-aware merging is performed using real model tokenizers to respect token limits. This process results in fewer, denser chunks that preserve contextual coherence, enhancing RAG efficiency and interpretability.

Quick Start & Requirements

  • Install via pip: pip install advanced-chunker
  • Requires Python.
  • Official documentation and examples are available in the README.

Highlighted Details

  • Integrates with LangChain via SemanticChunkerSplitter.
  • Offers CLI for scripting and automation with export options to JSON, Markdown, and CSV.
  • Provides visualization tools for attention heatmaps, semantic graphs, and cluster previews.
  • Supports token-aware merging with real model tokenizers.

Maintenance & Community

  • Open to pull requests; issues should be opened for feature requests or bug fixes.
  • Mentions integrations with LangChain, Sentence Transformers, scikit-learn, and Hugging Face.

Licensing & Compatibility

  • MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The package is primarily focused on semantic chunking and clustering; advanced text preprocessing or complex NLP tasks beyond this scope are not covered. The effectiveness of the chunking is dependent on the quality of the underlying sentence embeddings.

Health Check
Last Commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Vasek Mlejnsky Vasek Mlejnsky(Cofounder of E2B).

super-rag by superagent-ai

0%
384
RAG pipeline for AI apps
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.