zerox  by getomni-ai

OCR SDK for AI ingestion of documents with complex layouts

created 1 year ago
11,620 stars

Top 4.4% on sourcepulse

GitHubView on GitHub
Project Summary

Zerox is an open-source library for Optical Character Recognition (OCR) and document data extraction, designed to process various document formats into structured Markdown using large vision models. It targets developers and researchers needing to ingest complex documents, including those with tables and varied layouts, into AI systems. The primary benefit is leveraging advanced vision models for accurate and context-aware document understanding.

How It Works

Zerox converts input documents (PDF, DOCX, images, etc.) into a series of images. Each image is then processed by a selected vision model (e.g., GPT-4o, Gemini) with specific prompts to extract content as Markdown. The library supports multiple LLM providers, including OpenAI, Azure OpenAI, AWS Bedrock, and Google Gemini. An optional maintainFormat feature allows context from previous pages to be passed to subsequent requests, improving consistency for documents with cross-page tables or complex formatting, albeit at the cost of slower processing.

Quick Start & Requirements

  • Node.js: npm install zerox
    • Requires graphicsmagick and ghostscript for PDF processing.
  • Python: pip install py-zerox
    • Requires poppler for PDF processing.
  • LLM API Keys: Required for chosen model providers.
  • Documentation: https://docs.getomni.ai/zerox
  • Demo: https://getomni.ai/ocr-demo

Highlighted Details

  • Supports a wide array of LLM providers and models for OCR and data extraction.
  • Offers structured data extraction via JSON schema.
  • Handles numerous file types by converting them to PDF first (e.g., DOCX, XLSX, PPTX).
  • Features like correctOrientation, trimEdges, and maintainFormat enhance output quality and consistency.

Maintenance & Community

The project is associated with getomni-ai. Further community or maintenance details are not explicitly detailed in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The Python SDK does not support features like data extraction schema, per-page extraction, or orientation correction, which are available in the Node.js version. Some advanced features may be platform-dependent or require specific system dependencies.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
580 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.