deepdoctection by deepdoctection

Document AI pipeline for RAG and model training

Created 4 years ago

3,117 stars

Top 15.2% on SourcePulse

3 Experts Love This Project

jerryjliu

Cofounder of LlamaIndex

simonw

Coauthor of Django

guygurari

Cofounder of Augment

Project Summary

This library orchestrates document layout analysis and extraction for RAG, targeting researchers and developers building Document AI pipelines. It offers a unified framework for training, evaluating, and inferencing models, simplifying complex document understanding tasks.

How It Works

Deepdoctection integrates multiple state-of-the-art models for layout analysis, OCR, and document classification. It leverages PyTorch with Detectron2 and Transformers, or TensorFlow with Tensorpack, for core vision tasks. For OCR, it supports Tesseract, DocTr, and AWS Textract. Document and token classification are handled by LayoutLM family models, LiLT, and BERT-style architectures, incorporating features like sliding windows. Additional utilities include text mining for PDFs, language detection, and image deskewing.

Quick Start & Requirements

Install via pip: pip install deepdoctection (with [pt] or [tf] for full features).
Requires Python >= 3.9.
PyTorch (>= 2.2) or TensorFlow (>= 2.11, < 2.16) is necessary. TensorFlow support is deprecated from Python 3.11.
GPU recommended for fine-tuning.
Linux or macOS; Windows is not supported but a Dockerfile is available.
See introduction notebook and Hugging Face Space Demo.

Highlighted Details

Supports Detectron2, Transformers, Tensorpack, Tesseract, DocTr, AWS Textract, LayoutLM, LiLT.
Offers fine-tuning and evaluation tools.
Includes tutorials and a Hugging Face Space demo.
Handles native PDFs via pdfplumber.

Maintenance & Community

Active development with recent v0.43 release.
Community support via GitHub stars and recommendations.

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and closed-source linking.

Limitations & Caveats

Windows is not officially supported, though a Docker solution exists. TensorFlow support is being phased out for newer Python versions.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

5

Issues (30d)

3

Star History

17 stars in the last 30 days

Explore Similar Projects

BetterOCR by junhoyeo

OCR tool combining multiple engines with LLM for improved text detection

Created 2 years ago

Updated 7 months ago

Versatile-OCR-Program by ses4255

OCR pipeline for ML training datasets from documents

Created 9 months ago

Updated 7 months ago

DeepSeek-OCR-WebUI by neosun100

Intelligent OCR web application for diverse document and image analysis

Created 2 months ago

Updated 3 weeks ago

awesome-ocr-resources by ZumingHuang

OCR resource collection (papers, datasets, APIs)

Created 7 years ago

Updated 1 year ago

OpenOCR by Topdu

General OCR toolkit for research and applications

Created 1 year ago

Updated 4 days ago

paperless-gpt by icereed

AI tool for paperless-ngx document management

Created 1 year ago

Updated 1 day ago

HunyuanOCR by Tencent-Hunyuan

Advanced OCR and document understanding via lightweight VLM

Created 1 month ago

Updated 1 week ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

AdvancedLiterateMachinery by AlibabaResearch

Collection of algorithms for Advanced Literate Machinery research

Created 3 years ago

Updated 9 months ago

PolyglotPDF by CBIhalsen

PDF tool for layout-preserving translation

Created 1 year ago

Updated 3 months ago

awesome-ocr by wanghaisheng

Curated list of OCR resources

Created 9 years ago

Updated 3 years ago

awesome-deep-text-detection-recognition by hwalsuklee

Curated list of deep learning papers for text detection/recognition

Created 8 years ago

Updated 4 years ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

GOT-OCR2.0 by Ucas-HaoranWei

OCR research paper for unified end-to-end model

Created 1 year ago

Updated 11 months ago

Feedback? Help us improve.