docformer  by shabie

PyTorch implementation for visual document understanding research

created 3 years ago
281 stars

Top 93.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of DocFormer, a multi-modal transformer architecture designed for Visual Document Understanding (VDU). It addresses the challenge of integrating text, vision, and spatial information for tasks like document layout analysis and information extraction, targeting researchers and engineers in document AI.

How It Works

DocFormer leverages a novel multi-modal self-attention mechanism to fuse text, vision, and spatial features. It utilizes unsupervised pre-training with carefully designed tasks to encourage multi-modal interaction. A key advantage is the sharing of learned spatial embeddings across modalities, enabling effective correlation between text and visual tokens.

Quick Start & Requirements

  • Install via pip: pip install pytesseract
  • System dependency: sudo apt install tesseract-ocr
  • Clone the repository: git clone https://github.com/shabie/docformer.git
  • Usage requires adding docformer/src/docformer/ to sys.path.
  • Dependencies include transformers and torch.
  • See the Kaggle notebook for fine-tuning on FUNSD.

Highlighted Details

  • Achieves state-of-the-art results on 4 VDU datasets, outperforming models up to 4x larger.
  • Employs unsupervised pre-training to enhance multi-modal feature interaction.
  • Integrates text, vision, and spatial features using a novel multi-modal self-attention layer.
  • Shares spatial embeddings across modalities for improved token correlation.

Maintenance & Community

  • Maintained by uakarsh shabie.
  • The project cites the original ICCV 2021 paper "DocFormer: End-to-End Transformer for Document Understanding".

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided implementation is an unofficial reproduction and may have issues, particularly with pytesseract imports. Pre-trained weights are available for MLM on a subset of the IDL Dataset.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.