OmniDocBench by opendatalab

Benchmark for document parsing and evaluation in real-world scenarios

Created 1 year ago

1,513 stars

Top 26.9% on SourcePulse

Project Summary

OmniDocBench is a comprehensive benchmark dataset and evaluation suite for document parsing tasks, targeting researchers and developers in document AI. It addresses the need for robust evaluation across diverse document types and parsing challenges, offering rich annotations for layout detection, text OCR, formula recognition, and table understanding.

How It Works

OmniDocBench provides 981 PDF pages with extensive annotations, including 15 block-level and 4 span-level document elements. It supports end-to-end evaluation and module-specific assessments, utilizing metrics like Normalized Edit Distance, BLEU, METEOR, TEDS, and COCODet. The benchmark's strength lies in its detailed attribute annotations (page and block levels) and flexible evaluation configurations, allowing for fine-grained analysis of model performance across various document characteristics.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.10, LaTeXML (for LaTeX table conversion).
Dataset: Download from Hugging Face or OpenDataLab.
Evaluation: Run python pdf_validation.py --config <config_path>
Docs: Dataset (🤗Hugging Face), Dataset (OpenDataLab), arXiv

Highlighted Details

Supports end-to-end evaluation with multiple matching methods (no_split, simple_match, quick_match).
Includes detailed evaluation for text OCR, formula recognition (display and inline), table recognition (HTML/LaTeX), and layout detection.
Benchmarks a wide range of models, including recent VLMs like GPT4o and Qwen2.5-VL.
Offers fine-grained evaluation by page and attribute, enabling precise pain point identification.

Maintenance & Community

The project has seen recent updates (March 2025) including new model evaluations and acceptance into CVPR 2025. Feedback and PRs are welcomed via GitHub issues.

Licensing & Compatibility

The dataset is for research purposes only and not for commercial use. Copyright concerns should be directed to OpenDataLab@pjlab.org.cn.

Limitations & Caveats

Text evaluation currently only supports English and Simplified Chinese; Unicode mapping for special characters is planned. Some models may produce non-standard outputs requiring post-processing for accurate matching.

OmniDocBench by opendatalab

Explore Similar Projects

ferrules by AmineDiro

T-Eval by open-compass

SmartResume by alibaba

langchain-benchmarks by langchain-ai

spacy-layout by explosion

Logics-Parsing by alibaba

llm-autoeval by mlabonne

HunyuanOCR by Tencent-Hunyuan

GLM-OCR by zai-org

llmsherpa by nlmatics

PDF-Extract-Kit by opendatalab

funNLP by fighting41love