Discover and explore top open-source AI tools and projects—updated daily.
shabiePyTorch implementation for visual document understanding research
Top 91.3% on SourcePulse
This repository provides a PyTorch implementation of DocFormer, a multi-modal transformer architecture designed for Visual Document Understanding (VDU). It addresses the challenge of integrating text, vision, and spatial information for tasks like document layout analysis and information extraction, targeting researchers and engineers in document AI.
How It Works
DocFormer leverages a novel multi-modal self-attention mechanism to fuse text, vision, and spatial features. It utilizes unsupervised pre-training with carefully designed tasks to encourage multi-modal interaction. A key advantage is the sharing of learned spatial embeddings across modalities, enabling effective correlation between text and visual tokens.
Quick Start & Requirements
pip install pytesseractsudo apt install tesseract-ocrgit clone https://github.com/shabie/docformer.gitdocformer/src/docformer/ to sys.path.transformers and torch.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided implementation is an unofficial reproduction and may have issues, particularly with pytesseract imports. Pre-trained weights are available for MLM on a subset of the IDL Dataset.
2 years ago
Inactive
zdou0830
kohjingyu
kohjingyu
mlfoundations
lucidrains
pliang279