docformer by shabie

PyTorch implementation for visual document understanding research

Created 4 years ago

286 stars

Top 91.7% on SourcePulse

Project Summary

This repository provides a PyTorch implementation of DocFormer, a multi-modal transformer architecture designed for Visual Document Understanding (VDU). It addresses the challenge of integrating text, vision, and spatial information for tasks like document layout analysis and information extraction, targeting researchers and engineers in document AI.

How It Works

DocFormer leverages a novel multi-modal self-attention mechanism to fuse text, vision, and spatial features. It utilizes unsupervised pre-training with carefully designed tasks to encourage multi-modal interaction. A key advantage is the sharing of learned spatial embeddings across modalities, enabling effective correlation between text and visual tokens.

Quick Start & Requirements

Install via pip: pip install pytesseract
System dependency: sudo apt install tesseract-ocr
Clone the repository: git clone https://github.com/shabie/docformer.git
Usage requires adding docformer/src/docformer/ to sys.path.
Dependencies include transformers and torch.
See the Kaggle notebook for fine-tuning on FUNSD.

Highlighted Details

Achieves state-of-the-art results on 4 VDU datasets, outperforming models up to 4x larger.
Employs unsupervised pre-training to enhance multi-modal feature interaction.
Integrates text, vision, and spatial features using a novel multi-modal self-attention layer.
Shares spatial embeddings across modalities for improved token correlation.

Maintenance & Community

Maintained by uakarsh shabie.
The project cites the original ICCV 2021 paper "DocFormer: End-to-End Transformer for Document Understanding".

Licensing & Compatibility

Licensed under the MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided implementation is an unofficial reproduction and may have issues, particularly with pytesseract imports. Pre-trained weights are available for MLM on a subset of the IDL Dataset.

docformer by shabie

Explore Similar Projects

CrossFlow by qihao067

CM3Leon by kyegomez

METER by zdou0830

ScreenAI by kyegomez

fromage by kohjingyu

gill by kohjingyu

VL-BERT by jackroos

vilbert-multi-task by facebookresearch

open_flamingo by mlfoundations

x-transformers by lucidrains

ImageBind by facebookresearch

awesome-multimodal-ml by pliang279