OCR pipeline for ML training datasets from documents
Top 51.7% on sourcepulse
This project provides a multi-modal OCR pipeline specifically designed for extracting structured data from complex educational documents, such as exam papers, for machine learning training. It targets students, researchers, and developers needing to create high-quality datasets from multilingual text, mathematical formulas, tables, diagrams, and charts, offering semantically annotated outputs for enhanced model training.
How It Works
The system employs a two-stage workflow. Stage 1 uses ocr_stage1.py
for initial OCR extraction, leveraging DocLayout-YOLO for layout detection and tools like Google Vision API and MathPix OCR for content recognition. Stage 2 (ocr_stage2.py
) processes these intermediate results to generate structured, AI-ready outputs (JSON/Markdown), including natural language descriptions for visual elements and summaries for tables. This approach aims for high accuracy and preserves contextual continuity through coordinate and metadata retention.
Quick Start & Requirements
ocr_stage1.py
followed by ocr_stage2.py
. Specific installation commands are not provided.Highlighted Details
Maintenance & Community
The project is described as open and community-driven, with a stated goal of continuous updates. Contact is available via email: ses425500000@gmail.com. No specific community channels or roadmap links are provided.
Licensing & Compatibility
Licensed under GNU Affero General Public License v3.0 (AGPL-3.0). This license requires that any derivative or deployed version, including web services, must publicly share its complete source code.
Limitations & Caveats
The project is described as "COMING SOON" for a "Next-Level AI Pipeline Integration" and a beta version is nearing completion, indicating ongoing development. Specific installation instructions and detailed requirements are absent, potentially increasing setup complexity.
2 months ago
Inactive