zerox by getomni-ai

OCR SDK for AI ingestion of documents with complex layouts

Created 1 year ago

12,144 stars

Top 4.2% on SourcePulse

View on GitHub

5 Experts Love This Project

Eric Ciarla

Cofounder of Firecrawl

John Philip Morgan

Cofounder of Jasper

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Pawel Garbacki

Cofounder of Fireworks AI

and 1 more!

Project Summary

Zerox is an open-source library for Optical Character Recognition (OCR) and document data extraction, designed to process various document formats into structured Markdown using large vision models. It targets developers and researchers needing to ingest complex documents, including those with tables and varied layouts, into AI systems. The primary benefit is leveraging advanced vision models for accurate and context-aware document understanding.

How It Works

Zerox converts input documents (PDF, DOCX, images, etc.) into a series of images. Each image is then processed by a selected vision model (e.g., GPT-4o, Gemini) with specific prompts to extract content as Markdown. The library supports multiple LLM providers, including OpenAI, Azure OpenAI, AWS Bedrock, and Google Gemini. An optional maintainFormat feature allows context from previous pages to be passed to subsequent requests, improving consistency for documents with cross-page tables or complex formatting, albeit at the cost of slower processing.

Quick Start & Requirements

Node.js: npm install zerox
- Requires graphicsmagick and ghostscript for PDF processing.
Python: pip install py-zerox
- Requires poppler for PDF processing.
LLM API Keys: Required for chosen model providers.
Documentation: https://docs.getomni.ai/zerox
Demo: https://getomni.ai/ocr-demo

Highlighted Details

Supports a wide array of LLM providers and models for OCR and data extraction.
Offers structured data extraction via JSON schema.
Handles numerous file types by converting them to PDF first (e.g., DOCX, XLSX, PPTX).
Features like correctOrientation, trimEdges, and maintainFormat enhance output quality and consistency.

Maintenance & Community

The project is associated with getomni-ai. Further community or maintenance details are not explicitly detailed in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The Python SDK does not support features like data extraction schema, per-page extraction, or orientation correction, which are available in the Node.js version. Some advanced features may be platform-dependent or require specific system dependencies.

Health Check

Last Commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

125 stars in the last 30 days