HtmlRAG by plageon

RAG system using HTML for modeling retrieval results

Created 1 year ago

455 stars

Top 66.3% on SourcePulse

Project Summary

HtmlRAG enhances Retrieval-Augmented Generation (RAG) systems by leveraging HTML structure over plain text for improved knowledge modeling. It targets researchers and developers building advanced RAG applications, offering a novel approach to handle the complexity of web content for more accurate and context-aware responses.

How It Works

HtmlRAG introduces two key techniques: Lossless HTML Cleaning to remove irrelevant content while preserving semantic information, and a Two-Step Block-Tree-Based HTML Pruning. The pruning process first uses an embedding model to score HTML blocks and then a generative model to refine the selection, effectively managing the long context inherent in HTML documents. This approach aims to retain crucial information that might be lost in traditional text-based RAG pipelines.

Quick Start & Requirements

Installation: pip install htmlrag or pip install -e . from the toolkit/ directory.
Dependencies: Python 3.9+, PyTorch 2.0.1, CUDA 11.7, FAISS-CPU, scikit-learn, transformers, accelerate, bitsandbytes, vllm, and others listed in environment.yml.
Setup: Conda environment creation (conda env create -f environment.yml) is recommended.
Documentation: English Documentation

Highlighted Details

Achieves state-of-the-art results on multiple RAG benchmarks (ASQA, HotpotQA, NQ, TriviaQA, MuSiQue, ELI5).
Supports both English and Chinese HTML documents.
Provides pre-trained models for HTML pruning (e.g., HTML-Pruner-Phi-3.8B).
Includes scripts for data preparation, cleaning, pruning, and evaluation.

Maintenance & Community

The project is associated with a WWW 2025 paper.
Data and models are available on ModelScope and Hugging Face datasets (HtmlRAG-train, HtmlRAG-test).
No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The presence of specific model licenses (e.g., from Hugging Face) should be checked for compatibility.

Limitations & Caveats

Parameter max_node_words was removed in GenHTMLPruner since v0.1.0, requiring model file updates for users migrating from older versions.
The full dataset is not included in the repository due to Git file size limitations, requiring separate downloads from Hugging Face.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

Awesome-RAG by frutik

RAG resource list

Created 2 years ago

Updated 4 months ago

awesome-rag by coree

Curated list of resources for retrieval-augmented generation (RAG) in LLMs

Created 1 year ago

Updated 1 month ago

RAG-Book by Nipi64310

RAG system evolution and implementation

Created 1 year ago

Updated 1 year ago

llm-mcp-rag by KelvinQiu802

Augmented LLM for RAG and MCP agents

Created 9 months ago

Updated 9 months ago

MasteringRAG by Steven-Luo

LLM-based RAG system for enterprise document Q&A

Created 1 year ago

Updated 6 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Simon Willison

Simon Willison(Coauthor of Django), and

1 more.

Lumos by andrewnguonly

Chrome extension for local LLM web RAG co-piloting

Created 2 years ago

Updated 11 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

RAG-Survey by hymie122

RAG paper collection for AI-Generated Content

Created 1 year ago

Updated 1 year ago

all-rag-techniques by FareedKhan-dev

Jupyter notebooks for RAG technique implementations

Created 10 months ago

Updated 6 months ago

rag-zero-to-hero-guide by KalyanKS-NLP

RAG learning guide, from basics to advanced

Created 10 months ago

Updated 9 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

2 more.

lazynlp by chiphuyen

Web scraping library for creating massive datasets

Created 6 years ago

Updated 5 years ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Yiran Wu

Yiran Wu(Coauthor of AutoGen), and

2 more.

RAG_Techniques by NirDiamant

RAG techniques showcase for enhanced generation systems

Created 1 year ago

Updated 1 month ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Rodrigo Nader

Rodrigo Nader(Cofounder of Langflow), and

9 more.

ragflow by infiniflow

Open-source RAG engine for deep document understanding

Created 2 years ago

Updated 1 day ago

Feedback? Help us improve.