Discover and explore top open-source AI tools and projects—updated daily.
plageonRAG system using HTML for modeling retrieval results
Top 67.0% on SourcePulse
HtmlRAG enhances Retrieval-Augmented Generation (RAG) systems by leveraging HTML structure over plain text for improved knowledge modeling. It targets researchers and developers building advanced RAG applications, offering a novel approach to handle the complexity of web content for more accurate and context-aware responses.
How It Works
HtmlRAG introduces two key techniques: Lossless HTML Cleaning to remove irrelevant content while preserving semantic information, and a Two-Step Block-Tree-Based HTML Pruning. The pruning process first uses an embedding model to score HTML blocks and then a generative model to refine the selection, effectively managing the long context inherent in HTML documents. This approach aims to retain crucial information that might be lost in traditional text-based RAG pipelines.
Quick Start & Requirements
pip install htmlrag or pip install -e . from the toolkit/ directory.environment.yml.conda env create -f environment.yml) is recommended.Highlighted Details
HTML-Pruner-Phi-3.8B).Maintenance & Community
HtmlRAG-train, HtmlRAG-test).Licensing & Compatibility
Limitations & Caveats
max_node_words was removed in GenHTMLPruner since v0.1.0, requiring model file updates for users migrating from older versions.4 months ago
1 day
andrewnguonly
NirDiamant