HtmlRAG  by plageon

RAG system using HTML for modeling retrieval results

Created 1 year ago
445 stars

Top 67.5% on SourcePulse

GitHubView on GitHub
Project Summary

HtmlRAG enhances Retrieval-Augmented Generation (RAG) systems by leveraging HTML structure over plain text for improved knowledge modeling. It targets researchers and developers building advanced RAG applications, offering a novel approach to handle the complexity of web content for more accurate and context-aware responses.

How It Works

HtmlRAG introduces two key techniques: Lossless HTML Cleaning to remove irrelevant content while preserving semantic information, and a Two-Step Block-Tree-Based HTML Pruning. The pruning process first uses an embedding model to score HTML blocks and then a generative model to refine the selection, effectively managing the long context inherent in HTML documents. This approach aims to retain crucial information that might be lost in traditional text-based RAG pipelines.

Quick Start & Requirements

  • Installation: pip install htmlrag or pip install -e . from the toolkit/ directory.
  • Dependencies: Python 3.9+, PyTorch 2.0.1, CUDA 11.7, FAISS-CPU, scikit-learn, transformers, accelerate, bitsandbytes, vllm, and others listed in environment.yml.
  • Setup: Conda environment creation (conda env create -f environment.yml) is recommended.
  • Documentation: English Documentation

Highlighted Details

  • Achieves state-of-the-art results on multiple RAG benchmarks (ASQA, HotpotQA, NQ, TriviaQA, MuSiQue, ELI5).
  • Supports both English and Chinese HTML documents.
  • Provides pre-trained models for HTML pruning (e.g., HTML-Pruner-Phi-3.8B).
  • Includes scripts for data preparation, cleaning, pruning, and evaluation.

Maintenance & Community

  • The project is associated with a WWW 2025 paper.
  • Data and models are available on ModelScope and Hugging Face datasets (HtmlRAG-train, HtmlRAG-test).
  • No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The repository does not explicitly state a license. The presence of specific model licenses (e.g., from Hugging Face) should be checked for compatibility.

Limitations & Caveats

  • Parameter max_node_words was removed in GenHTMLPruner since v0.1.0, requiring model file updates for users migrating from older versions.
  • The full dataset is not included in the repository due to Git file size limitations, requiring separate downloads from Hugging Face.
Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
1 more.

Lumos by andrewnguonly

0%
2k
Chrome extension for local LLM web RAG co-piloting
Created 1 year ago
Updated 7 months ago
Feedback? Help us improve.