RAG pipeline for web archive exploration
Top 98.8% on sourcepulse
WARC-GPT is an experimental Retrieval Augmented Generation (RAG) pipeline designed for querying large web archive collections. It enables users to ingest WARC files, generate embeddings, and interact with the content via a REST API or a web UI, making historical web data accessible through AI.
How It Works
WARC-GPT processes WARC files, extracting text from HTML and PDF records. It then generates embeddings for this text, splitting it according to the embedding model's context window. These embeddings are stored in a vector database (ChromaDB by default), forming a knowledge base that the RAG pipeline uses to retrieve relevant context for answering user queries via an LLM.
Quick Start & Requirements
poetry install
or pip install .
after cloning../warc
directory.poetry run flask run
.poetry run flask ingest
.Highlighted Details
text/html
and application/pdf
records.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This project is experimental and not guaranteed to be production-grade. While it prioritizes security and privacy, initial experiments may focus on closed-loop feedback, and future development paths are subject to change or sunsetting.
5 months ago
1 day