Layout analysis dataset for document understanding tasks
Top 53.9% on sourcepulse
DocBank provides a large-scale, fine-grained dataset for document layout analysis, enabling models to integrate textual and layout information. It targets researchers and practitioners in NLP and computer vision working on document understanding tasks, offering a robust benchmark for evaluating layout analysis models.
How It Works
DocBank leverages a weak supervision approach to generate fine-grained, token-level annotations for 12 semantic document units across 500,000 pages. This method allows for efficient annotation compared to manual labeling, making it suitable for training complex models that require detailed structural information. The dataset is designed to be compatible with both text-based sequence labeling and image-based object detection approaches.
Quick Start & Requirements
indexed_files
directory.scripts/pdf_process.py
is provided to convert PDF files to the DocBank format.
cd scripts
python pdf_process.py --data_dir /path/to/pdf/directory --output_dir /path/to/data/output/directory
Highlighted Details
Maintenance & Community
The project paper was accepted at COLING2020. Trained models are available in the DocBank Model Zoo.
Licensing & Compatibility
The dataset license has been updated to Apache-2.0. The MSCOCO Format Annotation can be downloaded separately. Redistribution of the data is discouraged.
Limitations & Caveats
The dataset is primarily focused on academic papers and may not fully represent all document types. While the weak supervision approach is efficient, the quality of annotations might vary compared to purely human-labeled datasets.
11 months ago
Inactive