Versatile-OCR-Program  by ses4255

OCR pipeline for ML training datasets from documents

created 4 months ago
661 stars

Top 51.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a multi-modal OCR pipeline specifically designed for extracting structured data from complex educational documents, such as exam papers, for machine learning training. It targets students, researchers, and developers needing to create high-quality datasets from multilingual text, mathematical formulas, tables, diagrams, and charts, offering semantically annotated outputs for enhanced model training.

How It Works

The system employs a two-stage workflow. Stage 1 uses ocr_stage1.py for initial OCR extraction, leveraging DocLayout-YOLO for layout detection and tools like Google Vision API and MathPix OCR for content recognition. Stage 2 (ocr_stage2.py) processes these intermediate results to generate structured, AI-ready outputs (JSON/Markdown), including natural language descriptions for visual elements and summaries for tables. This approach aims for high accuracy and preserves contextual continuity through coordinate and metadata retention.

Quick Start & Requirements

  • Install/Run: The README outlines a two-step Python script execution: ocr_stage1.py followed by ocr_stage2.py. Specific installation commands are not provided.
  • Prerequisites: Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and DocLayout-YOLO are mentioned as core components. Python 3.x is implied. GPU/CUDA requirements are not specified.
  • Resources: Setup time and resource footprint are not detailed.
  • Links: No direct links to quick-start guides or demos are provided.

Highlighted Details

  • Achieves 90-95% accuracy on academic datasets like EJU Biology and UTokyo Math.
  • Generates English-translated semantic context for figures and tables.
  • Supports Japanese, Korean, and English, with extensibility for other languages.
  • Processes complex layouts with dense scientific content and formulas.

Maintenance & Community

The project is described as open and community-driven, with a stated goal of continuous updates. Contact is available via email: ses425500000@gmail.com. No specific community channels or roadmap links are provided.

Licensing & Compatibility

Licensed under GNU Affero General Public License v3.0 (AGPL-3.0). This license requires that any derivative or deployed version, including web services, must publicly share its complete source code.

Limitations & Caveats

The project is described as "COMING SOON" for a "Next-Level AI Pipeline Integration" and a beta version is nearing completion, indicating ongoing development. Specific installation instructions and detailed requirements are absent, potentially increasing setup complexity.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
50 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.