DocBank  by doc-analysis

Layout analysis dataset for document understanding tasks

created 5 years ago
622 stars

Top 53.9% on sourcepulse

GitHubView on GitHub
Project Summary

DocBank provides a large-scale, fine-grained dataset for document layout analysis, enabling models to integrate textual and layout information. It targets researchers and practitioners in NLP and computer vision working on document understanding tasks, offering a robust benchmark for evaluating layout analysis models.

How It Works

DocBank leverages a weak supervision approach to generate fine-grained, token-level annotations for 12 semantic document units across 500,000 pages. This method allows for efficient annotation compared to manual labeling, making it suitable for training complex models that require detailed structural information. The dataset is designed to be compatible with both text-based sequence labeling and image-based object detection approaches.

Quick Start & Requirements

  • Data Access: Datasets are available on HuggingFace. Preview samples and index files are in the indexed_files directory.
  • PDF Processing: A script scripts/pdf_process.py is provided to convert PDF files to the DocBank format.
    cd scripts
    python pdf_process.py --data_dir /path/to/pdf/directory --output_dir /path/to/data/output/directory
    
  • Dependencies: Requires Python and pip for the PDF processing script. Model training typically involves deep learning frameworks (e.g., HuggingFace Transformers, Detectron2) and GPU resources (8 V100 GPUs mentioned for training).
  • Resources: Fine-tuning a model on 400K pages takes approximately 5 hours per epoch.

Highlighted Details

  • 500,000 document pages with 12 semantic unit types (e.g., Paragraph, Section, Figure, Table).
  • Fine-grained token-level annotations, enabling sequence labeling tasks.
  • Supports both text-based (BERT, RoBERTa, LayoutLM) and image-based (Faster R-CNN with ResNeXt-101) models.
  • Introduces a new evaluation metric tailored for text-based document layout analysis.

Maintenance & Community

The project paper was accepted at COLING2020. Trained models are available in the DocBank Model Zoo.

Licensing & Compatibility

The dataset license has been updated to Apache-2.0. The MSCOCO Format Annotation can be downloaded separately. Redistribution of the data is discouraged.

Limitations & Caveats

The dataset is primarily focused on academic papers and may not fully represent all document types. While the weak supervision approach is efficient, the quality of annotations might vary compared to purely human-labeled datasets.

Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.