docquery  by impira

Document query engine for extracting information from documents

created 3 years ago
1,771 stars

Top 24.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DocQuery is a library and CLI tool for extracting information from documents (PDFs, images) using LLMs, targeting users who need to query and analyze document content. It simplifies asking questions of documents, enabling automated data extraction for use cases like invoice processing and contract analysis.

How It Works

DocQuery leverages a pre-trained LayoutLM model fine-tuned for question answering on documents. This approach combines visual understanding (LayoutLM) with natural language processing, making it adept at visual question answering (VQA) tasks on semi-structured and unstructured documents. The model is trained on SQuAD2.0 and DocVQA datasets, allowing it to answer questions based on document content and layout.

Quick Start & Requirements

  • Install: pip install docquery
  • OCR: Requires tesseract (install via brew install tesseract on macOS or apt install tesseract-ocr on Ubuntu).
  • Demo: Hugging Face Spaces, Colab Notebook
  • CLI Usage: docquery scan "Your question?" /path/to/document

Highlighted Details

  • Supports querying documents via HTTP/HTTPS URLs.
  • Can classify documents using Hugging Face models (e.g., Donut) with the --classify flag.
  • Offers an optional [donut] extra for using the Donut model for classification and VQA.
  • Processes documents locally for enhanced security and privacy.

Maintenance & Community

  • Status: Not actively maintained, but welcomes community contributions and discussions.
  • Community: Discord

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Contains adapted code from Hugging Face's transformers library (Apache 2.0 licensed). Generally compatible with commercial use, but users should be aware of the dual licensing implications for adapted code.

Limitations & Caveats

The project is not actively maintained. It relies on pre-trained models and does not support custom data training. Output is limited to scalar text; it does not extract structured data like tables or richer scalar types (numbers, dates are treated as strings). Support is limited to images and PDFs.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.