HtmlRAG  by plageon

RAG system using HTML for modeling retrieval results

created 11 months ago
433 stars

Top 69.8% on sourcepulse

GitHubView on GitHub
Project Summary

HtmlRAG enhances Retrieval-Augmented Generation (RAG) systems by leveraging HTML structure over plain text for improved knowledge modeling. It targets researchers and developers building advanced RAG applications, offering a novel approach to handle the complexity of web content for more accurate and context-aware responses.

How It Works

HtmlRAG introduces two key techniques: Lossless HTML Cleaning to remove irrelevant content while preserving semantic information, and a Two-Step Block-Tree-Based HTML Pruning. The pruning process first uses an embedding model to score HTML blocks and then a generative model to refine the selection, effectively managing the long context inherent in HTML documents. This approach aims to retain crucial information that might be lost in traditional text-based RAG pipelines.

Quick Start & Requirements

  • Installation: pip install htmlrag or pip install -e . from the toolkit/ directory.
  • Dependencies: Python 3.9+, PyTorch 2.0.1, CUDA 11.7, FAISS-CPU, scikit-learn, transformers, accelerate, bitsandbytes, vllm, and others listed in environment.yml.
  • Setup: Conda environment creation (conda env create -f environment.yml) is recommended.
  • Documentation: English Documentation

Highlighted Details

  • Achieves state-of-the-art results on multiple RAG benchmarks (ASQA, HotpotQA, NQ, TriviaQA, MuSiQue, ELI5).
  • Supports both English and Chinese HTML documents.
  • Provides pre-trained models for HTML pruning (e.g., HTML-Pruner-Phi-3.8B).
  • Includes scripts for data preparation, cleaning, pruning, and evaluation.

Maintenance & Community

  • The project is associated with a WWW 2025 paper.
  • Data and models are available on ModelScope and Hugging Face datasets (HtmlRAG-train, HtmlRAG-test).
  • No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The repository does not explicitly state a license. The presence of specific model licenses (e.g., from Hugging Face) should be checked for compatibility.

Limitations & Caveats

  • Parameter max_node_words was removed in GenHTMLPruner since v0.1.0, requiring model file updates for users migrating from older versions.
  • The full dataset is not included in the repository due to Git file size limitations, requiring separate downloads from Hugging Face.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
39 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.