docext  by NanoNets

On-premises tool for document information extraction and benchmarking

created 4 months ago
1,568 stars

Top 27.2% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit provides an on-premises solution for unstructured document information extraction and benchmarking, targeting developers and researchers in Intelligent Document Processing (IDP). It leverages Vision-Language Models (VLMs) to extract structured data, including tabular information, from documents like invoices and passports, offering an OCR-free approach.

How It Works

Docext utilizes VLMs to interpret document content and structure, enabling the extraction of both key fields and complex tables without relying on traditional OCR. This vision-language approach allows for more nuanced understanding of document layouts and content, facilitating accurate data extraction and providing confidence scores for predictions.

Quick Start & Requirements

  • Install: pip install docext
  • Prerequisites: Linux or macOS. Specific VLM dependencies are managed by the toolkit.
  • Resources: Requires sufficient hardware to run VLMs locally.
  • Docs: Full feature guide

Highlighted Details

  • On-premises deployment for data privacy and control.
  • Supports custom field definitions and pre-built templates for invoices and passports.
  • Includes a REST API for integration and multi-page document processing.
  • Features an Intelligent Document Processing Leaderboard for evaluating VLM performance across various IDP tasks.

Maintenance & Community

Developed by Nanonets, a company specializing in document AI. Contributions are welcomed via issues and pull requests.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The toolkit is primarily focused on VLM-based extraction and does not include traditional OCR capabilities. Performance is dependent on the underlying VLM used and the complexity of the document structure.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
1,476 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.