Document query engine for extracting information from documents
Top 24.8% on sourcepulse
DocQuery is a library and CLI tool for extracting information from documents (PDFs, images) using LLMs, targeting users who need to query and analyze document content. It simplifies asking questions of documents, enabling automated data extraction for use cases like invoice processing and contract analysis.
How It Works
DocQuery leverages a pre-trained LayoutLM model fine-tuned for question answering on documents. This approach combines visual understanding (LayoutLM) with natural language processing, making it adept at visual question answering (VQA) tasks on semi-structured and unstructured documents. The model is trained on SQuAD2.0 and DocVQA datasets, allowing it to answer questions based on document content and layout.
Quick Start & Requirements
pip install docquery
tesseract
(install via brew install tesseract
on macOS or apt install tesseract-ocr
on Ubuntu).docquery scan "Your question?" /path/to/document
Highlighted Details
--classify
flag.[donut]
extra for using the Donut model for classification and VQA.Maintenance & Community
Licensing & Compatibility
transformers
library (Apache 2.0 licensed). Generally compatible with commercial use, but users should be aware of the dual licensing implications for adapted code.Limitations & Caveats
The project is not actively maintained. It relies on pre-trained models and does not support custom data training. Output is limited to scalar text; it does not extract structured data like tables or richer scalar types (numbers, dates are treated as strings). Support is limited to images and PDFs.
2 years ago
Inactive