documind  by DocumindHQ

Open-source platform for structured data extraction from documents

Created 10 months ago
1,417 stars

Top 28.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Documind is an open-source platform designed for AI-powered structured data extraction from documents, primarily PDFs. It caters to developers and businesses needing to automate information retrieval, offering features like custom schema definition, Markdown conversion, and support for both OpenAI and local LLMs (Llava, Llama3.2-vision). The platform aims to simplify document processing by converting unstructured text into usable JSON formats.

How It Works

Documind utilizes Large Language Models (LLMs) to parse document content and extract data according to user-defined schemas. The core approach involves sending document text and a schema definition to an LLM, which then returns structured JSON. It supports auto-generating schemas based on document content and offers pre-defined templates for common document types like invoices and bank statements, streamlining the extraction process.

Quick Start & Requirements

  • Install via npm: npm install documind
  • System Dependencies: Ghostscript, GraphicsMagick (install via brew on macOS or apt-get on Debian/Ubuntu).
  • Node.js & NPM: v18+ required.
  • Environment: Requires an .env file with OPENAI_API_KEY=your_openai_api_key.
  • Documentation: https://documind.org/docs/
  • Hosted Version Beta: https://documind.org/

Highlighted Details

  • Supports extraction from PDF, DOCX, PNG, JPG, TXT, and HTML.
  • Offers auto-generated schemas and pre-defined templates for common document types.
  • Integrates with OpenAI and local LLMs like Llava and Llama3.2-vision.
  • Outputs structured JSON and can convert documents to Markdown.

Maintenance & Community

The project is built on top of Zerox. Contributions are welcomed via pull requests. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

Licensed under AGPL v3.0. The README also mentions an MIT license from Zerox in the core folder and root license file, which may require clarification regarding combined usage and compatibility for commercial or closed-source applications.

Limitations & Caveats

The AGPL v3.0 license has strong copyleft provisions that may impact integration into proprietary software. The dual licensing mention requires careful review to understand full implications. Upcoming features like image extraction and advanced formatters are not yet implemented.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
1 more.

MinerU by opendatalab

1.2%
44k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.