documind  by DocumindHQ

Open-source platform for structured data extraction from documents

created 8 months ago
1,358 stars

Top 30.2% on sourcepulse

GitHubView on GitHub
Project Summary

Documind is an open-source platform designed for AI-powered structured data extraction from documents, primarily PDFs. It caters to developers and businesses needing to automate information retrieval, offering features like custom schema definition, Markdown conversion, and support for both OpenAI and local LLMs (Llava, Llama3.2-vision). The platform aims to simplify document processing by converting unstructured text into usable JSON formats.

How It Works

Documind utilizes Large Language Models (LLMs) to parse document content and extract data according to user-defined schemas. The core approach involves sending document text and a schema definition to an LLM, which then returns structured JSON. It supports auto-generating schemas based on document content and offers pre-defined templates for common document types like invoices and bank statements, streamlining the extraction process.

Quick Start & Requirements

  • Install via npm: npm install documind
  • System Dependencies: Ghostscript, GraphicsMagick (install via brew on macOS or apt-get on Debian/Ubuntu).
  • Node.js & NPM: v18+ required.
  • Environment: Requires an .env file with OPENAI_API_KEY=your_openai_api_key.
  • Documentation: https://documind.org/docs/
  • Hosted Version Beta: https://documind.org/

Highlighted Details

  • Supports extraction from PDF, DOCX, PNG, JPG, TXT, and HTML.
  • Offers auto-generated schemas and pre-defined templates for common document types.
  • Integrates with OpenAI and local LLMs like Llava and Llama3.2-vision.
  • Outputs structured JSON and can convert documents to Markdown.

Maintenance & Community

The project is built on top of Zerox. Contributions are welcomed via pull requests. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

Licensed under AGPL v3.0. The README also mentions an MIT license from Zerox in the core folder and root license file, which may require clarification regarding combined usage and compatibility for commercial or closed-source applications.

Limitations & Caveats

The AGPL v3.0 license has strong copyleft provisions that may impact integration into proprietary software. The dual licensing mention requires careful review to understand full implications. Upcoming features like image extraction and advanced formatters are not yet implemented.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
61 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Dan Guido Dan Guido(Cofounder of Trail of Bits), and
8 more.

markitdown by microsoft

0.9%
70k
Python tool for converting files to Markdown for LLM text analysis
created 8 months ago
updated 2 months ago
Feedback? Help us improve.