OCR benchmark for multimodal models
Top 59.9% on sourcepulse
This project provides an open-source benchmark for evaluating Optical Character Recognition (OCR) and data extraction capabilities of Large Multimodal Models (LMMs) and traditional OCR providers. It targets researchers and developers aiming to compare model performance on document processing tasks, particularly JSON extraction accuracy, with transparent methodologies and datasets.
How It Works
The benchmark follows a Document ⇒ OCR ⇒ Extraction pipeline. It measures how well models can perform OCR on documents and then extract structured data (specifically JSON) from the OCR'd text. Evaluation uses a modified JSON diff for accuracy, calculating it as 1 - (difference fields / total fields), and also includes Levenshtein distance for text similarity, though this metric is noted to be sensitive to minor layout variations.
Quick Start & Requirements
npm install
data/
or set DATABASE_URL
in .env
.models.example.yaml
to models.yaml
and set API keys in .env
.npm run benchmark
results/<timestamp>/results.json
.Highlighted Details
Maintenance & Community
The project is maintained by OmniAI. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The Levenshtein distance metric is sensitive to minor text layout changes, potentially penalizing accurate extractions that don't match ground truth formatting precisely. JSON extraction is not supported for all listed open-source LLMs or cloud OCR providers.
2 months ago
1 week