docquery by impira

Document query engine for extracting information from documents

Created 3 years ago

1,783 stars

Top 23.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Luca Soldaini

Research Scientist at Ai2

Jeff Hammerbacher

Cofounder of Cloudera

Quinn Slack

Cofounder of Sourcegraph

Project Summary

DocQuery is a library and CLI tool for extracting information from documents (PDFs, images) using LLMs, targeting users who need to query and analyze document content. It simplifies asking questions of documents, enabling automated data extraction for use cases like invoice processing and contract analysis.

How It Works

DocQuery leverages a pre-trained LayoutLM model fine-tuned for question answering on documents. This approach combines visual understanding (LayoutLM) with natural language processing, making it adept at visual question answering (VQA) tasks on semi-structured and unstructured documents. The model is trained on SQuAD2.0 and DocVQA datasets, allowing it to answer questions based on document content and layout.

Quick Start & Requirements

Install: pip install docquery
OCR: Requires tesseract (install via brew install tesseract on macOS or apt install tesseract-ocr on Ubuntu).
Demo: Hugging Face Spaces, Colab Notebook
CLI Usage: docquery scan "Your question?" /path/to/document

Highlighted Details

Supports querying documents via HTTP/HTTPS URLs.
Can classify documents using Hugging Face models (e.g., Donut) with the --classify flag.
Offers an optional [donut] extra for using the Donut model for classification and VQA.
Processes documents locally for enhanced security and privacy.

Maintenance & Community

Status: Not actively maintained, but welcomes community contributions and discussions.
Community: Discord

Licensing & Compatibility

License: MIT.
Compatibility: Contains adapted code from Hugging Face's transformers library (Apache 2.0 licensed). Generally compatible with commercial use, but users should be aware of the dual licensing implications for adapted code.

Limitations & Caveats

The project is not actively maintained. It relies on pre-trained models and does not support custom data training. Output is limited to scalar text; it does not extract structured data like tables or richer scalar types (numbers, dates are treated as strings). Support is limited to images and PDFs.

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days