docai  by PragmaticMachineLearning

Tool for structured data extraction from unstructured documents

Created 1 year ago
317 stars

Top 85.2% on SourcePulse

GitHubView on GitHub
Project Summary

DocAI is a Python library designed for structured information extraction from unstructured documents, targeting developers and researchers working with document analysis and data retrieval. It leverages advanced language models to parse PDFs and extract specific data points, outputting them in a structured, Pydantic-compatible format.

How It Works

The system utilizes Langchain for orchestration, integrating with OpenAI's GPT-4o model for sophisticated natural language understanding and Answer.AI's Byaldi for document parsing. It processes documents by building an index from a specified folder of PDFs, enabling efficient querying and extraction of predefined data structures, such as loss history or basic application details.

Quick Start & Requirements

  • Install: poetry install
  • Prerequisites: Python 3.10.6, OPENAI_API_KEY, HF_TOKEN.
  • Usage:
    • Build index: python scripts/build_index.py --folder "pdfs/" --index_name "application"
    • Extract data: python scripts/extract.py
  • Documentation: [Not explicitly linked, but usage scripts are provided.]

Highlighted Details

  • Leverages OpenAI's GPT-4o for advanced extraction capabilities.
  • Employs Langchain for robust workflow management.
  • Outputs structured data using Pydantic models for easy integration.
  • Demonstrates extraction of specific data like LossHistory and Application details.

Maintenance & Community

  • Project maintained by PragmaticMachineLearning.
  • No explicit community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not specify a license.

Limitations & Caveats

The project requires specific API keys for OpenAI and Hugging Face, and relies on Python 3.10.6, potentially limiting compatibility with other environments. The absence of a specified license raises concerns for commercial use.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.