Discover and explore top open-source AI tools and projects—updated daily.
PragmaticMachineLearningTool for structured data extraction from unstructured documents
Top 84.9% on SourcePulse
DocAI is a Python library designed for structured information extraction from unstructured documents, targeting developers and researchers working with document analysis and data retrieval. It leverages advanced language models to parse PDFs and extract specific data points, outputting them in a structured, Pydantic-compatible format.
How It Works
The system utilizes Langchain for orchestration, integrating with OpenAI's GPT-4o model for sophisticated natural language understanding and Answer.AI's Byaldi for document parsing. It processes documents by building an index from a specified folder of PDFs, enabling efficient querying and extraction of predefined data structures, such as loss history or basic application details.
Quick Start & Requirements
poetry installOPENAI_API_KEY, HF_TOKEN.python scripts/build_index.py --folder "pdfs/" --index_name "application"python scripts/extract.pyHighlighted Details
LossHistory and Application details.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project requires specific API keys for OpenAI and Hugging Face, and relies on Python 3.10.6, potentially limiting compatibility with other environments. The absence of a specified license raises concerns for commercial use.
1 year ago
Inactive
finic-ai
nlmatics