documind by DocumindHQ

Open-source platform for structured data extraction from documents

Created 1 year ago

1,463 stars

Top 27.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

Documind is an open-source platform designed for AI-powered structured data extraction from documents, primarily PDFs. It caters to developers and businesses needing to automate information retrieval, offering features like custom schema definition, Markdown conversion, and support for both OpenAI and local LLMs (Llava, Llama3.2-vision). The platform aims to simplify document processing by converting unstructured text into usable JSON formats.

How It Works

Documind utilizes Large Language Models (LLMs) to parse document content and extract data according to user-defined schemas. The core approach involves sending document text and a schema definition to an LLM, which then returns structured JSON. It supports auto-generating schemas based on document content and offers pre-defined templates for common document types like invoices and bank statements, streamlining the extraction process.

Quick Start & Requirements

Install via npm: npm install documind
System Dependencies: Ghostscript, GraphicsMagick (install via brew on macOS or apt-get on Debian/Ubuntu).
Node.js & NPM: v18+ required.
Environment: Requires an .env file with OPENAI_API_KEY=your_openai_api_key.
Documentation: https://documind.org/docs/
Hosted Version Beta: https://documind.org/

Highlighted Details

Supports extraction from PDF, DOCX, PNG, JPG, TXT, and HTML.
Offers auto-generated schemas and pre-defined templates for common document types.
Integrates with OpenAI and local LLMs like Llava and Llama3.2-vision.
Outputs structured JSON and can convert documents to Markdown.

Maintenance & Community

The project is built on top of Zerox. Contributions are welcomed via pull requests. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

Licensed under AGPL v3.0. The README also mentions an MIT license from Zerox in the core folder and root license file, which may require clarification regarding combined usage and compatibility for commercial or closed-source applications.

Limitations & Caveats

The AGPL v3.0 license has strong copyleft provisions that may impact integration into proprietary software. The dual licensing mention requires careful review to understand full implications. Upcoming features like image extraction and advanced formatters are not yet implemented.

Health Check

Last Commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days