OCR SDK for AI ingestion of documents with complex layouts
Top 4.4% on sourcepulse
Zerox is an open-source library for Optical Character Recognition (OCR) and document data extraction, designed to process various document formats into structured Markdown using large vision models. It targets developers and researchers needing to ingest complex documents, including those with tables and varied layouts, into AI systems. The primary benefit is leveraging advanced vision models for accurate and context-aware document understanding.
How It Works
Zerox converts input documents (PDF, DOCX, images, etc.) into a series of images. Each image is then processed by a selected vision model (e.g., GPT-4o, Gemini) with specific prompts to extract content as Markdown. The library supports multiple LLM providers, including OpenAI, Azure OpenAI, AWS Bedrock, and Google Gemini. An optional maintainFormat
feature allows context from previous pages to be passed to subsequent requests, improving consistency for documents with cross-page tables or complex formatting, albeit at the cost of slower processing.
Quick Start & Requirements
npm install zerox
graphicsmagick
and ghostscript
for PDF processing.pip install py-zerox
poppler
for PDF processing.Highlighted Details
correctOrientation
, trimEdges
, and maintainFormat
enhance output quality and consistency.Maintenance & Community
The project is associated with getomni-ai. Further community or maintenance details are not explicitly detailed in the README.
Licensing & Compatibility
Limitations & Caveats
The Python SDK does not support features like data extraction schema, per-page extraction, or orientation correction, which are available in the Node.js version. Some advanced features may be platform-dependent or require specific system dependencies.
2 months ago
1 day