ExtractThinker  by enoch3712

Document intelligence library for LLMs, ORM-style interaction

created 1 year ago
1,313 stars

Top 31.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ExtractThinker is a Python library designed for document intelligence, enabling users to extract and classify structured data from various document formats using Large Language Models (LLMs). It targets developers and researchers working with document processing workflows, offering an ORM-like interface for simplified interaction with LLMs and document data.

How It Works

ExtractThinker employs a modular architecture, inspired by LangChain, comprising Document Loaders for data ingestion, Extractors for LLM orchestration, Splitters for chunking, Contracts (Pydantic models) for defining output schemas, Classifications for routing, and Processes for managing workflows. This design facilitates specialized document processing, aiming for higher accuracy and ease of use compared to general-purpose frameworks.

Quick Start & Requirements

  • Primary install: pip install extract_thinker
  • Prerequisites: Python 3.x, dotenv for environment variables. LLM API keys (e.g., OpenAI) are required for most functionalities. Tesseract OCR path may be needed for image processing.
  • Resources: LLM API costs, potential GPU for local models.
  • Documentation: Examples directory.

Highlighted Details

  • Supports multiple document loaders: Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, PyPdf.
  • Enables custom extraction contracts via Pydantic models.
  • Offers asynchronous processing and flexible splitting strategies (lazy/eager).
  • Integrates with various LLM providers (OpenAI, Anthropic, Cohere, Azure OpenAI) and local models (Ollama).

Maintenance & Community

  • Community-driven development, inspired by LangChain.
  • Resources include examples, Medium articles, and a test suite.
  • Community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions "path_to_your_files" and "path_to_your_document.pdf" as placeholders, indicating that users must configure these paths correctly. Some advanced features like OCR or specific LLM integrations may require additional setup or dependencies not detailed in the basic installation.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
2
Star History
97 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Paul Copplestone Paul Copplestone(Cofounder of Supabase), and
2 more.

MegaParse by QuivrHQ

0.5%
7k
File parser optimized for LLM ingestion
created 1 year ago
updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 21 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
2 more.

llmware by llmware-ai

0.2%
14k
Framework for enterprise RAG pipelines using small, specialized models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.