deepdoctection  by deepdoctection

Document AI pipeline for RAG and model training

created 3 years ago
2,919 stars

Top 16.3% on SourcePulse

GitHubView on GitHub
Project Summary

This library orchestrates document layout analysis and extraction for RAG, targeting researchers and developers building Document AI pipelines. It offers a unified framework for training, evaluating, and inferencing models, simplifying complex document understanding tasks.

How It Works

Deepdoctection integrates multiple state-of-the-art models for layout analysis, OCR, and document classification. It leverages PyTorch with Detectron2 and Transformers, or TensorFlow with Tensorpack, for core vision tasks. For OCR, it supports Tesseract, DocTr, and AWS Textract. Document and token classification are handled by LayoutLM family models, LiLT, and BERT-style architectures, incorporating features like sliding windows. Additional utilities include text mining for PDFs, language detection, and image deskewing.

Quick Start & Requirements

  • Install via pip: pip install deepdoctection (with [pt] or [tf] for full features).
  • Requires Python >= 3.9.
  • PyTorch (>= 2.2) or TensorFlow (>= 2.11, < 2.16) is necessary. TensorFlow support is deprecated from Python 3.11.
  • GPU recommended for fine-tuning.
  • Linux or macOS; Windows is not supported but a Dockerfile is available.
  • See introduction notebook and Hugging Face Space Demo.

Highlighted Details

  • Supports Detectron2, Transformers, Tensorpack, Tesseract, DocTr, AWS Textract, LayoutLM, LiLT.
  • Offers fine-tuning and evaluation tools.
  • Includes tutorials and a Hugging Face Space demo.
  • Handles native PDFs via pdfplumber.

Maintenance & Community

  • Active development with recent v0.43 release.
  • Community support via GitHub stars and recommendations.

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and closed-source linking.

Limitations & Caveats

Windows is not officially supported, though a Docker solution exists. TensorFlow support is being phased out for newer Python versions.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
40 stars in the last 30 days

Explore Similar Projects

Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), and
72 more.

tensorflow by tensorflow

0.1%
191k
Open-source ML framework
created 9 years ago
updated 1 day ago
Feedback? Help us improve.