docai  by PragmaticMachineLearning

Tool for structured data extraction from unstructured documents

created 10 months ago
316 stars

Top 86.7% on sourcepulse

GitHubView on GitHub
Project Summary

DocAI is a Python library designed for structured information extraction from unstructured documents, targeting developers and researchers working with document analysis and data retrieval. It leverages advanced language models to parse PDFs and extract specific data points, outputting them in a structured, Pydantic-compatible format.

How It Works

The system utilizes Langchain for orchestration, integrating with OpenAI's GPT-4o model for sophisticated natural language understanding and Answer.AI's Byaldi for document parsing. It processes documents by building an index from a specified folder of PDFs, enabling efficient querying and extraction of predefined data structures, such as loss history or basic application details.

Quick Start & Requirements

  • Install: poetry install
  • Prerequisites: Python 3.10.6, OPENAI_API_KEY, HF_TOKEN.
  • Usage:
    • Build index: python scripts/build_index.py --folder "pdfs/" --index_name "application"
    • Extract data: python scripts/extract.py
  • Documentation: [Not explicitly linked, but usage scripts are provided.]

Highlighted Details

  • Leverages OpenAI's GPT-4o for advanced extraction capabilities.
  • Employs Langchain for robust workflow management.
  • Outputs structured data using Pydantic models for easy integration.
  • Demonstrates extraction of specific data like LossHistory and Application details.

Maintenance & Community

  • Project maintained by PragmaticMachineLearning.
  • No explicit community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not specify a license.

Limitations & Caveats

The project requires specific API keys for OpenAI and Hugging Face, and relies on Python 3.10.6, potentially limiting compatibility with other environments. The absence of a specified license raises concerns for commercial use.

Health Check
Last commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 1 day ago
Feedback? Help us improve.