Versatile-OCR-Program by ses4255

OCR pipeline for ML training datasets from documents

Created 11 months ago

682 stars

Top 49.7% on SourcePulse

Project Summary

This project provides a multi-modal OCR pipeline specifically designed for extracting structured data from complex educational documents, such as exam papers, for machine learning training. It targets students, researchers, and developers needing to create high-quality datasets from multilingual text, mathematical formulas, tables, diagrams, and charts, offering semantically annotated outputs for enhanced model training.

How It Works

The system employs a two-stage workflow. Stage 1 uses ocr_stage1.py for initial OCR extraction, leveraging DocLayout-YOLO for layout detection and tools like Google Vision API and MathPix OCR for content recognition. Stage 2 (ocr_stage2.py) processes these intermediate results to generate structured, AI-ready outputs (JSON/Markdown), including natural language descriptions for visual elements and summaries for tables. This approach aims for high accuracy and preserves contextual continuity through coordinate and metadata retention.

Quick Start & Requirements

Install/Run: The README outlines a two-step Python script execution: ocr_stage1.py followed by ocr_stage2.py. Specific installation commands are not provided.
Prerequisites: Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and DocLayout-YOLO are mentioned as core components. Python 3.x is implied. GPU/CUDA requirements are not specified.
Resources: Setup time and resource footprint are not detailed.
Links: No direct links to quick-start guides or demos are provided.

Highlighted Details

Achieves 90-95% accuracy on academic datasets like EJU Biology and UTokyo Math.
Generates English-translated semantic context for figures and tables.
Supports Japanese, Korean, and English, with extensibility for other languages.
Processes complex layouts with dense scientific content and formulas.

Maintenance & Community

The project is described as open and community-driven, with a stated goal of continuous updates. Contact is available via email: ses425500000@gmail.com. No specific community channels or roadmap links are provided.

Licensing & Compatibility

Licensed under GNU Affero General Public License v3.0 (AGPL-3.0). This license requires that any derivative or deployed version, including web services, must publicly share its complete source code.

Limitations & Caveats

The project is described as "COMING SOON" for a "Next-Level AI Pipeline Integration" and a beta version is nearing completion, indicating ongoing development. Specific installation instructions and detailed requirements are absent, potentially increasing setup complexity.

Versatile-OCR-Program by ses4255

Explore Similar Projects

ferrules by AmineDiro

YomiNinja by matt-m-o

Vary-toy by Ucas-HaoranWei

awesome-ocr-resources by ZumingHuang

thepipe by emcf

mPLUG-DocOwl by X-PLUG

deepdoctection by deepdoctection

Ollama-OCR by imanoop7

STranslate by STranslate

manga-image-translator by zyddnys

LunaTranslator by HIllya51

markitdown by microsoft