Benchmark for document parsing and evaluation in real-world scenarios
Top 53.1% on sourcepulse
OmniDocBench is a comprehensive benchmark dataset and evaluation suite for document parsing tasks, targeting researchers and developers in document AI. It addresses the need for robust evaluation across diverse document types and parsing challenges, offering rich annotations for layout detection, text OCR, formula recognition, and table understanding.
How It Works
OmniDocBench provides 981 PDF pages with extensive annotations, including 15 block-level and 4 span-level document elements. It supports end-to-end evaluation and module-specific assessments, utilizing metrics like Normalized Edit Distance, BLEU, METEOR, TEDS, and COCODet. The benchmark's strength lies in its detailed attribute annotations (page and block levels) and flexible evaluation configurations, allowing for fine-grained analysis of model performance across various document characteristics.
Quick Start & Requirements
pip install -r requirements.txt
python pdf_validation.py --config <config_path>
Highlighted Details
no_split
, simple_match
, quick_match
).Maintenance & Community
The project has seen recent updates (March 2025) including new model evaluations and acceptance into CVPR 2025. Feedback and PRs are welcomed via GitHub issues.
Licensing & Compatibility
The dataset is for research purposes only and not for commercial use. Copyright concerns should be directed to OpenDataLab@pjlab.org.cn.
Limitations & Caveats
Text evaluation currently only supports English and Simplified Chinese; Unicode mapping for special characters is planned. Some models may produce non-standard outputs requiring post-processing for accurate matching.
2 days ago
1 day