DocBank by doc-analysis

Layout analysis dataset for document understanding tasks

Created 5 years ago

634 stars

Top 52.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

DocBank provides a large-scale, fine-grained dataset for document layout analysis, enabling models to integrate textual and layout information. It targets researchers and practitioners in NLP and computer vision working on document understanding tasks, offering a robust benchmark for evaluating layout analysis models.

How It Works

DocBank leverages a weak supervision approach to generate fine-grained, token-level annotations for 12 semantic document units across 500,000 pages. This method allows for efficient annotation compared to manual labeling, making it suitable for training complex models that require detailed structural information. The dataset is designed to be compatible with both text-based sequence labeling and image-based object detection approaches.

Quick Start & Requirements

Data Access: Datasets are available on HuggingFace. Preview samples and index files are in the indexed_files directory.

PDF Processing: A script scripts/pdf_process.py is provided to convert PDF files to the DocBank format.

cd scripts
python pdf_process.py --data_dir /path/to/pdf/directory --output_dir /path/to/data/output/directory

Dependencies: Requires Python and pip for the PDF processing script. Model training typically involves deep learning frameworks (e.g., HuggingFace Transformers, Detectron2) and GPU resources (8 V100 GPUs mentioned for training).
Resources: Fine-tuning a model on 400K pages takes approximately 5 hours per epoch.

Highlighted Details

500,000 document pages with 12 semantic unit types (e.g., Paragraph, Section, Figure, Table).
Fine-grained token-level annotations, enabling sequence labeling tasks.
Supports both text-based (BERT, RoBERTa, LayoutLM) and image-based (Faster R-CNN with ResNeXt-101) models.
Introduces a new evaluation metric tailored for text-based document layout analysis.

Maintenance & Community

The project paper was accepted at COLING2020. Trained models are available in the DocBank Model Zoo.

Licensing & Compatibility

The dataset license has been updated to Apache-2.0. The MSCOCO Format Annotation can be downloaded separately. Redistribution of the data is discouraged.

Limitations & Caveats

The dataset is primarily focused on academic papers and may not fully represent all document types. While the weak supervision approach is efficient, the quality of annotations might vary compared to purely human-labeled datasets.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days