ExtractThinker  by enoch3712

Document intelligence library for LLMs, ORM-style interaction

Created 1 year ago
1,404 stars

Top 28.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ExtractThinker is a Python library designed for document intelligence, enabling users to extract and classify structured data from various document formats using Large Language Models (LLMs). It targets developers and researchers working with document processing workflows, offering an ORM-like interface for simplified interaction with LLMs and document data.

How It Works

ExtractThinker employs a modular architecture, inspired by LangChain, comprising Document Loaders for data ingestion, Extractors for LLM orchestration, Splitters for chunking, Contracts (Pydantic models) for defining output schemas, Classifications for routing, and Processes for managing workflows. This design facilitates specialized document processing, aiming for higher accuracy and ease of use compared to general-purpose frameworks.

Quick Start & Requirements

  • Primary install: pip install extract_thinker
  • Prerequisites: Python 3.x, dotenv for environment variables. LLM API keys (e.g., OpenAI) are required for most functionalities. Tesseract OCR path may be needed for image processing.
  • Resources: LLM API costs, potential GPU for local models.
  • Documentation: Examples directory.

Highlighted Details

  • Supports multiple document loaders: Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, PyPdf.
  • Enables custom extraction contracts via Pydantic models.
  • Offers asynchronous processing and flexible splitting strategies (lazy/eager).
  • Integrates with various LLM providers (OpenAI, Anthropic, Cohere, Azure OpenAI) and local models (Ollama).

Maintenance & Community

  • Community-driven development, inspired by LangChain.
  • Resources include examples, Medium articles, and a test suite.
  • Community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions "path_to_your_files" and "path_to_your_document.pdf" as placeholders, indicating that users must configure these paths correctly. Some advanced features like OCR or specific LLM integrations may require additional setup or dependencies not detailed in the basic installation.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
1
Star History
35 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.