Discover and explore top open-source AI tools and projects—updated daily.
harvard-lilRAG pipeline for web archive exploration
Top 97.2% on SourcePulse
WARC-GPT is an experimental Retrieval Augmented Generation (RAG) pipeline designed for querying large web archive collections. It enables users to ingest WARC files, generate embeddings, and interact with the content via a REST API or a web UI, making historical web data accessible through AI.
How It Works
WARC-GPT processes WARC files, extracting text from HTML and PDF records. It then generates embeddings for this text, splitting it according to the embedding model's context window. These embeddings are stored in a vector database (ChromaDB by default), forming a knowledge base that the RAG pipeline uses to retrieve relevant context for answering user queries via an LLM.
Quick Start & Requirements
poetry install or pip install . after cloning../warc directory.poetry run flask run.poetry run flask ingest.Highlighted Details
text/html and application/pdf records.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This project is experimental and not guaranteed to be production-grade. While it prioritizes security and privacy, initial experiments may focus on closed-loop feedback, and future development paths are subject to change or sunsetting.
8 months ago
Inactive
mukulpatnaik
huggingface