docext by NanoNets

On-premises tool for document information extraction and benchmarking

Created 11 months ago

1,858 stars

Top 22.9% on SourcePulse

Project Summary

This toolkit provides an on-premises solution for unstructured document information extraction and benchmarking, targeting developers and researchers in Intelligent Document Processing (IDP). It leverages Vision-Language Models (VLMs) to extract structured data, including tabular information, from documents like invoices and passports, offering an OCR-free approach.

How It Works

Docext utilizes VLMs to interpret document content and structure, enabling the extraction of both key fields and complex tables without relying on traditional OCR. This vision-language approach allows for more nuanced understanding of document layouts and content, facilitating accurate data extraction and providing confidence scores for predictions.

Quick Start & Requirements

Install: pip install docext
Prerequisites: Linux or macOS. Specific VLM dependencies are managed by the toolkit.
Resources: Requires sufficient hardware to run VLMs locally.
Docs: Full feature guide

Highlighted Details

On-premises deployment for data privacy and control.
Supports custom field definitions and pre-built templates for invoices and passports.
Includes a REST API for integration and multi-page document processing.
Features an Intelligent Document Processing Leaderboard for evaluating VLM performance across various IDP tasks.

Maintenance & Community

Developed by Nanonets, a company specializing in document AI. Contributions are welcomed via issues and pull requests.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The toolkit is primarily focused on VLM-based extraction and does not include traditional OCR capabilities. Performance is dependent on the underlying VLM used and the complexity of the document structure.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days