ExtractThinker by enoch3712

Document intelligence library for LLMs, ORM-style interaction

Created 1 year ago

1,474 stars

Top 27.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

ExtractThinker is a Python library designed for document intelligence, enabling users to extract and classify structured data from various document formats using Large Language Models (LLMs). It targets developers and researchers working with document processing workflows, offering an ORM-like interface for simplified interaction with LLMs and document data.

How It Works

ExtractThinker employs a modular architecture, inspired by LangChain, comprising Document Loaders for data ingestion, Extractors for LLM orchestration, Splitters for chunking, Contracts (Pydantic models) for defining output schemas, Classifications for routing, and Processes for managing workflows. This design facilitates specialized document processing, aiming for higher accuracy and ease of use compared to general-purpose frameworks.

Quick Start & Requirements

Primary install: pip install extract_thinker
Prerequisites: Python 3.x, dotenv for environment variables. LLM API keys (e.g., OpenAI) are required for most functionalities. Tesseract OCR path may be needed for image processing.
Resources: LLM API costs, potential GPU for local models.
Documentation: Examples directory.

Highlighted Details

Supports multiple document loaders: Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, PyPdf.
Enables custom extraction contracts via Pydantic models.
Offers asynchronous processing and flexible splitting strategies (lazy/eager).
Integrates with various LLM providers (OpenAI, Anthropic, Cohere, Azure OpenAI) and local models (Ollama).

Maintenance & Community

Community-driven development, inspired by LangChain.
Resources include examples, Medium articles, and a test suite.
Community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions "path_to_your_files" and "path_to_your_document.pdf" as placeholders, indicating that users must configure these paths correctly. Some advanced features like OCR or specific LLM integrations may require additional setup or dependencies not detailed in the basic installation.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days