PyTorch implementation for visual document understanding research
Top 93.7% on sourcepulse
This repository provides a PyTorch implementation of DocFormer, a multi-modal transformer architecture designed for Visual Document Understanding (VDU). It addresses the challenge of integrating text, vision, and spatial information for tasks like document layout analysis and information extraction, targeting researchers and engineers in document AI.
How It Works
DocFormer leverages a novel multi-modal self-attention mechanism to fuse text, vision, and spatial features. It utilizes unsupervised pre-training with carefully designed tasks to encourage multi-modal interaction. A key advantage is the sharing of learned spatial embeddings across modalities, enabling effective correlation between text and visual tokens.
Quick Start & Requirements
pip install pytesseract
sudo apt install tesseract-ocr
git clone https://github.com/shabie/docformer.git
docformer/src/docformer/
to sys.path
.transformers
and torch
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided implementation is an unofficial reproduction and may have issues, particularly with pytesseract
imports. Pre-trained weights are available for MLM on a subset of the IDL Dataset.
2 years ago
Inactive