Discover and explore top open-source AI tools and projects—updated daily.
epstein-docsAI-powered archive for searchable public documents
Top 91.2% on SourcePulse
This project addresses the challenge of making large volumes of scanned legal and public documents searchable and accessible. It targets researchers, journalists, and power users by leveraging AI-powered Optical Character Recognition (OCR) and Natural Language Processing (NLP) to create a dynamic, searchable archive of publicly released documents related to the Jeffrey Epstein case. The primary benefit is enhanced accessibility and analytical capability for a significant corpus of sensitive public records.
How It Works
The system employs AI vision models for full OCR, extracting both printed and handwritten text from scanned images. It then utilizes LLMs to identify and index entities (people, organizations, locations, dates), reconstruct multi-page documents, and generate summaries and key topics. A static site generator (11ty) builds a fast, lightweight, and searchable web interface from the processed data. Key advantages include resume-friendly processing, automatic handling of LLM inconsistencies, and deduplication of entities and document types for improved data integrity.
Quick Start & Requirements
pip install -r requirements.txtnpm install.env.example to .env and configure an OpenAI-compatible API endpoint.downloads/ and run python process_images.py. Optional scripts include cleanup_failed.py, deduplicate.py, deduplicate_types.py, and analyze_documents.py.npm run build to generate the static site in _site/ or npm start for a development server.Highlighted Details
Maintenance & Community
The site is automatically deployed to GitHub Pages upon pushes to the main branch. Contributions are welcomed for improving OCR accuracy, UI, adding document sources, or enhancing entity extraction. No specific community channels (e.g., Discord, Slack) are listed in the README.
Licensing & Compatibility
This project is licensed under the MIT License, making the code open-source and free to use. The documents themselves are public records. The license does not appear to impose restrictions on commercial use or linking with closed-source projects.
Limitations & Caveats
This is an independent archival project, and the maintainers explicitly state they make no representations regarding the completeness or accuracy of the archive. Future planned features, such as relationship graphs, are not yet implemented, indicating potential complexity for advanced network analysis. The reliance on an external OpenAI-compatible API may introduce costs or specific setup requirements.
4 months ago
Inactive
aryn-ai
VectifyAI