epstein-docs.github.io by epstein-docs

AI-powered archive for searchable public documents

Created 4 months ago

290 stars

Top 91.2% on SourcePulse

Project Summary

This project addresses the challenge of making large volumes of scanned legal and public documents searchable and accessible. It targets researchers, journalists, and power users by leveraging AI-powered Optical Character Recognition (OCR) and Natural Language Processing (NLP) to create a dynamic, searchable archive of publicly released documents related to the Jeffrey Epstein case. The primary benefit is enhanced accessibility and analytical capability for a significant corpus of sensitive public records.

How It Works

The system employs AI vision models for full OCR, extracting both printed and handwritten text from scanned images. It then utilizes LLMs to identify and index entities (people, organizations, locations, dates), reconstruct multi-page documents, and generate summaries and key topics. A static site generator (11ty) builds a fast, lightweight, and searchable web interface from the processed data. Key advantages include resume-friendly processing, automatic handling of LLM inconsistencies, and deduplication of entities and document types for improved data integrity.

Quick Start & Requirements

Installation:
- Python dependencies: pip install -r requirements.txt
- Node.js dependencies: npm install
Configuration: Copy .env.example to .env and configure an OpenAI-compatible API endpoint.
Processing: Place document images in downloads/ and run python process_images.py. Optional scripts include cleanup_failed.py, deduplicate.py, deduplicate_types.py, and analyze_documents.py.
Website Generation: Execute npm run build to generate the static site in _site/ or npm start for a development server.
Prerequisites: Python, Node.js, and access to an OpenAI-compatible API.

Highlighted Details

Full OCR capabilities, including handwritten text extraction.
Automatic identification and indexing of named entities.
AI-driven document analysis yielding summaries, key topics, and significance.
Reconstruction of multi-page documents from individual scans.
A static, searchable web interface for browsing entities and documents.
Automated deduplication for entities and document types to ensure consistency.

Maintenance & Community

The site is automatically deployed to GitHub Pages upon pushes to the main branch. Contributions are welcomed for improving OCR accuracy, UI, adding document sources, or enhancing entity extraction. No specific community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

This project is licensed under the MIT License, making the code open-source and free to use. The documents themselves are public records. The license does not appear to impose restrictions on commercial use or linking with closed-source projects.

Limitations & Caveats

This is an independent archival project, and the maintainers explicitly state they make no representations regarding the completeness or accuracy of the archive. Future planned features, such as relationship graphs, are not yet implemented, indicating potential complexity for advanced network analysis. The reliance on an external OpenAI-compatible API may introduce costs or specific setup requirements.

epstein-docs.github.io by epstein-docs

Explore Similar Projects

PdfGptIndexer by raghavan

mcp-documentation-server by andrea9293

aixplora by grumpyp

kiroku by cnunescoelho

sycamore by aryn-ai

paperless-gpt by icereed

ColiVara by tjmlabs

localGPT-Vision by PromtEngineer

Burner-X by Feather-2

pdfGPT by bhaskatripathi

PageIndex by VectifyAI

WeKnora by Tencent