docformer  by shabie

PyTorch implementation for visual document understanding research

Created 4 years ago
284 stars

Top 92.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of DocFormer, a multi-modal transformer architecture designed for Visual Document Understanding (VDU). It addresses the challenge of integrating text, vision, and spatial information for tasks like document layout analysis and information extraction, targeting researchers and engineers in document AI.

How It Works

DocFormer leverages a novel multi-modal self-attention mechanism to fuse text, vision, and spatial features. It utilizes unsupervised pre-training with carefully designed tasks to encourage multi-modal interaction. A key advantage is the sharing of learned spatial embeddings across modalities, enabling effective correlation between text and visual tokens.

Quick Start & Requirements

  • Install via pip: pip install pytesseract
  • System dependency: sudo apt install tesseract-ocr
  • Clone the repository: git clone https://github.com/shabie/docformer.git
  • Usage requires adding docformer/src/docformer/ to sys.path.
  • Dependencies include transformers and torch.
  • See the Kaggle notebook for fine-tuning on FUNSD.

Highlighted Details

  • Achieves state-of-the-art results on 4 VDU datasets, outperforming models up to 4x larger.
  • Employs unsupervised pre-training to enhance multi-modal feature interaction.
  • Integrates text, vision, and spatial features using a novel multi-modal self-attention layer.
  • Shares spatial embeddings across modalities for improved token correlation.

Maintenance & Community

  • Maintained by uakarsh shabie.
  • The project cites the original ICCV 2021 paper "DocFormer: End-to-End Transformer for Document Understanding".

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided implementation is an unofficial reproduction and may have issues, particularly with pytesseract imports. Pre-trained weights are available for MLM on a subset of the IDL Dataset.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 4 years ago
Updated 5 days ago
Feedback? Help us improve.