benchmark by getomni-ai

OCR benchmark for multimodal models

Created 1 year ago

607 stars

Top 54.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Didier Lopes

Founder of OpenBB

Project Summary

This project provides an open-source benchmark for evaluating Optical Character Recognition (OCR) and data extraction capabilities of Large Multimodal Models (LMMs) and traditional OCR providers. It targets researchers and developers aiming to compare model performance on document processing tasks, particularly JSON extraction accuracy, with transparent methodologies and datasets.

How It Works

The benchmark follows a Document ⇒ OCR ⇒ Extraction pipeline. It measures how well models can perform OCR on documents and then extract structured data (specifically JSON) from the OCR'd text. Evaluation uses a modified JSON diff for accuracy, calculating it as 1 - (difference fields / total fields), and also includes Levenshtein distance for text similarity, though this metric is noted to be sensitive to minor layout variations.

Quick Start & Requirements

Install dependencies: npm install
Prepare data: Add local files to data/ or set DATABASE_URL in .env.
Configure models: Copy models.example.yaml to models.yaml and set API keys in .env.
Run benchmark: npm run benchmark
Prerequisites: Node.js, API keys for tested models (OpenAI, Anthropic, Gemini, Mistral, OmniAI, ZeroX), and potentially Google Cloud credentials for Document AI.
Results: Saved to results/<timestamp>/results.json.
Documentation: Benchmark Dashboard

Highlighted Details

Evaluates both closed-source (GPT-4o, Gemini, Claude 3.5 Sonnet) and open-source LLMs (Gemma 3, Qwen 2.5, Llama 3.2).
Includes traditional cloud OCR providers (AWS Textract, Azure Document Intelligence, Google Document AI, Unstructured).
Focuses on JSON extraction accuracy as the primary metric.
Supports direct image extraction capabilities of models.

Maintenance & Community

The project is maintained by OmniAI. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The Levenshtein distance metric is sensitive to minor text layout changes, potentially penalizing accurate extractions that don't match ground truth formatting precisely. JSON extraction is not supported for all listed open-source LLMs or cloud OCR providers.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days