Tool for structured data extraction from unstructured documents
Top 86.7% on sourcepulse
DocAI is a Python library designed for structured information extraction from unstructured documents, targeting developers and researchers working with document analysis and data retrieval. It leverages advanced language models to parse PDFs and extract specific data points, outputting them in a structured, Pydantic-compatible format.
How It Works
The system utilizes Langchain for orchestration, integrating with OpenAI's GPT-4o model for sophisticated natural language understanding and Answer.AI's Byaldi for document parsing. It processes documents by building an index from a specified folder of PDFs, enabling efficient querying and extraction of predefined data structures, such as loss history or basic application details.
Quick Start & Requirements
poetry install
OPENAI_API_KEY
, HF_TOKEN
.python scripts/build_index.py --folder "pdfs/" --index_name "application"
python scripts/extract.py
Highlighted Details
LossHistory
and Application
details.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project requires specific API keys for OpenAI and Hugging Face, and relies on Python 3.10.6, potentially limiting compatibility with other environments. The absence of a specified license raises concerns for commercial use.
10 months ago
Inactive