RAG system using HTML for modeling retrieval results
Top 69.8% on sourcepulse
HtmlRAG enhances Retrieval-Augmented Generation (RAG) systems by leveraging HTML structure over plain text for improved knowledge modeling. It targets researchers and developers building advanced RAG applications, offering a novel approach to handle the complexity of web content for more accurate and context-aware responses.
How It Works
HtmlRAG introduces two key techniques: Lossless HTML Cleaning to remove irrelevant content while preserving semantic information, and a Two-Step Block-Tree-Based HTML Pruning. The pruning process first uses an embedding model to score HTML blocks and then a generative model to refine the selection, effectively managing the long context inherent in HTML documents. This approach aims to retain crucial information that might be lost in traditional text-based RAG pipelines.
Quick Start & Requirements
pip install htmlrag
or pip install -e .
from the toolkit/
directory.environment.yml
.conda env create -f environment.yml
) is recommended.Highlighted Details
HTML-Pruner-Phi-3.8B
).Maintenance & Community
HtmlRAG-train
, HtmlRAG-test
).Licensing & Compatibility
Limitations & Caveats
max_node_words
was removed in GenHTMLPruner
since v0.1.0, requiring model file updates for users migrating from older versions.1 month ago
1 day