warc-gpt by harvard-lil

RAG pipeline for web archive exploration

Created 2 years ago

267 stars

Top 96.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

WARC-GPT is an experimental Retrieval Augmented Generation (RAG) pipeline designed for querying large web archive collections. It enables users to ingest WARC files, generate embeddings, and interact with the content via a REST API or a web UI, making historical web data accessible through AI.

How It Works

WARC-GPT processes WARC files, extracting text from HTML and PDF records. It then generates embeddings for this text, splitting it according to the embedding model's context window. These embeddings are stored in a vector database (ChromaDB by default), forming a knowledge base that the RAG pipeline uses to retrieve relevant context for answering user queries via an LLM.

Quick Start & Requirements

Install dependencies using Poetry: poetry install or pip install . after cloning.
Requires Python 3.11+.
WARC files should be placed in a ./warc directory.
Start the server with poetry run flask run.
Ingest WARCs with poetry run flask ingest.
Official documentation: https://github.com/harvard-lil/warc-gpt

Highlighted Details

Supports OpenAI API and Ollama for LLM inference, allowing local or cloud-based models.
Provides a REST API for programmatic interaction and a Web UI for direct use.
Includes functionality to visualize embeddings using T-SNE.
Extracts text from text/html and application/pdf records.

Maintenance & Community

Developed by Harvard's Library Innovation Lab.
Released under the MIT License.

Licensing & Compatibility

MIT License.
Permissive for commercial use and integration with closed-source applications.

Limitations & Caveats

This project is experimental and not guaranteed to be production-grade. While it prioritizes security and privacy, initial experiments may focus on closed-loop feedback, and future development paths are subject to change or sunsetting.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days