warc-gpt  by harvard-lil

RAG pipeline for web archive exploration

created 1 year ago
257 stars

Top 98.8% on sourcepulse

GitHubView on GitHub
Project Summary

WARC-GPT is an experimental Retrieval Augmented Generation (RAG) pipeline designed for querying large web archive collections. It enables users to ingest WARC files, generate embeddings, and interact with the content via a REST API or a web UI, making historical web data accessible through AI.

How It Works

WARC-GPT processes WARC files, extracting text from HTML and PDF records. It then generates embeddings for this text, splitting it according to the embedding model's context window. These embeddings are stored in a vector database (ChromaDB by default), forming a knowledge base that the RAG pipeline uses to retrieve relevant context for answering user queries via an LLM.

Quick Start & Requirements

  • Install dependencies using Poetry: poetry install or pip install . after cloning.
  • Requires Python 3.11+.
  • WARC files should be placed in a ./warc directory.
  • Start the server with poetry run flask run.
  • Ingest WARCs with poetry run flask ingest.
  • Official documentation: https://github.com/harvard-lil/warc-gpt

Highlighted Details

  • Supports OpenAI API and Ollama for LLM inference, allowing local or cloud-based models.
  • Provides a REST API for programmatic interaction and a Web UI for direct use.
  • Includes functionality to visualize embeddings using T-SNE.
  • Extracts text from text/html and application/pdf records.

Maintenance & Community

  • Developed by Harvard's Library Innovation Lab.
  • Released under the MIT License.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

This project is experimental and not guaranteed to be production-grade. While it prioritizes security and privacy, initial experiments may focus on closed-loop feedback, and future development paths are subject to change or sunsetting.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.