epstein-docs.github.io  by epstein-docs

AI-powered archive for searchable public documents

Created 4 months ago
290 stars

Top 91.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project addresses the challenge of making large volumes of scanned legal and public documents searchable and accessible. It targets researchers, journalists, and power users by leveraging AI-powered Optical Character Recognition (OCR) and Natural Language Processing (NLP) to create a dynamic, searchable archive of publicly released documents related to the Jeffrey Epstein case. The primary benefit is enhanced accessibility and analytical capability for a significant corpus of sensitive public records.

How It Works

The system employs AI vision models for full OCR, extracting both printed and handwritten text from scanned images. It then utilizes LLMs to identify and index entities (people, organizations, locations, dates), reconstruct multi-page documents, and generate summaries and key topics. A static site generator (11ty) builds a fast, lightweight, and searchable web interface from the processed data. Key advantages include resume-friendly processing, automatic handling of LLM inconsistencies, and deduplication of entities and document types for improved data integrity.

Quick Start & Requirements

  • Installation:
    • Python dependencies: pip install -r requirements.txt
    • Node.js dependencies: npm install
  • Configuration: Copy .env.example to .env and configure an OpenAI-compatible API endpoint.
  • Processing: Place document images in downloads/ and run python process_images.py. Optional scripts include cleanup_failed.py, deduplicate.py, deduplicate_types.py, and analyze_documents.py.
  • Website Generation: Execute npm run build to generate the static site in _site/ or npm start for a development server.
  • Prerequisites: Python, Node.js, and access to an OpenAI-compatible API.

Highlighted Details

  • Full OCR capabilities, including handwritten text extraction.
  • Automatic identification and indexing of named entities.
  • AI-driven document analysis yielding summaries, key topics, and significance.
  • Reconstruction of multi-page documents from individual scans.
  • A static, searchable web interface for browsing entities and documents.
  • Automated deduplication for entities and document types to ensure consistency.

Maintenance & Community

The site is automatically deployed to GitHub Pages upon pushes to the main branch. Contributions are welcomed for improving OCR accuracy, UI, adding document sources, or enhancing entity extraction. No specific community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

This project is licensed under the MIT License, making the code open-source and free to use. The documents themselves are public records. The license does not appear to impose restrictions on commercial use or linking with closed-source projects.

Limitations & Caveats

This is an independent archival project, and the maintainers explicitly state they make no representations regarding the completeness or accuracy of the archive. Future planned features, such as relationship graphs, are not yet implemented, indicating potential complexity for advanced network analysis. The reliance on an external OpenAI-compatible API may introduce costs or specific setup requirements.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
126 stars in the last 30 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy) and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

PageIndex by VectifyAI

3.9%
16k
Document index system for reasoning-based RAG
Created 10 months ago
Updated 1 week ago
Feedback? Help us improve.