OmniDocBench  by opendatalab

Benchmark for document parsing and evaluation in real-world scenarios

created 9 months ago
636 stars

Top 53.1% on sourcepulse

GitHubView on GitHub
Project Summary

OmniDocBench is a comprehensive benchmark dataset and evaluation suite for document parsing tasks, targeting researchers and developers in document AI. It addresses the need for robust evaluation across diverse document types and parsing challenges, offering rich annotations for layout detection, text OCR, formula recognition, and table understanding.

How It Works

OmniDocBench provides 981 PDF pages with extensive annotations, including 15 block-level and 4 span-level document elements. It supports end-to-end evaluation and module-specific assessments, utilizing metrics like Normalized Edit Distance, BLEU, METEOR, TEDS, and COCODet. The benchmark's strength lies in its detailed attribute annotations (page and block levels) and flexible evaluation configurations, allowing for fine-grained analysis of model performance across various document characteristics.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.10, LaTeXML (for LaTeX table conversion).
  • Dataset: Download from Hugging Face or OpenDataLab.
  • Evaluation: Run python pdf_validation.py --config <config_path>
  • Docs: Dataset (🤗Hugging Face), Dataset (OpenDataLab), arXiv

Highlighted Details

  • Supports end-to-end evaluation with multiple matching methods (no_split, simple_match, quick_match).
  • Includes detailed evaluation for text OCR, formula recognition (display and inline), table recognition (HTML/LaTeX), and layout detection.
  • Benchmarks a wide range of models, including recent VLMs like GPT4o and Qwen2.5-VL.
  • Offers fine-grained evaluation by page and attribute, enabling precise pain point identification.

Maintenance & Community

The project has seen recent updates (March 2025) including new model evaluations and acceptance into CVPR 2025. Feedback and PRs are welcomed via GitHub issues.

Licensing & Compatibility

The dataset is for research purposes only and not for commercial use. Copyright concerns should be directed to OpenDataLab@pjlab.org.cn.

Limitations & Caveats

Text evaluation currently only supports English and Simplified Chinese; Unicode mapping for special characters is planned. Some models may produce non-standard outputs requiring post-processing for accurate matching.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
17
Star History
243 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.