docuglean-ocr  by cernis-intelligence

Intelligent document processing SDK for AI-powered data extraction

Created 4 months ago
529 stars

Top 59.7% on SourcePulse

GitHubView on GitHub
Project Summary

Intelligent document processing is addressed by Docuglean, a unified SDK designed to extract structured data like JSON, Markdown, and HTML from documents using state-of-the-art AI models. It targets engineers and power users needing to automate document analysis, offering multilingual and multimodal capabilities with plug-and-play APIs for OCR, data extraction, classification, summarization, and translation. The SDK aims to simplify complex document workflows with easy-to-use interfaces and broad AI provider support.

How It Works

Docuglean provides a unified SDK with plug-and-play APIs for various document processing tasks. It leverages multiple AI providers, including OpenAI, Mistral, Google Gemini, and Hugging Face, supporting both multimodal (PDFs, images) inputs. A key advantage is its type-safe structured data extraction using Zod (TypeScript) or Pydantic (Python) schemas, ensuring data integrity. The system also includes built-in local parsers for common formats like DOCX, PPTX, XLSX, CSV, TSV, and PDF, reducing external dependencies for basic parsing.

Quick Start & Requirements

  • Primary Install:
    • Node.js/TypeScript: npm install docuglean-ocr
    • Python: pip install docuglean
  • Prerequisites: API keys are required for AI providers (OpenAI, Mistral, Google Gemini, Hugging Face). Local parsers for DOCX, PPTX, XLSX, CSV, TSV, and PDF do not require an API key.
  • Links: Code examples for Quick Start are provided within the README.

Highlighted Details

  • Easy-to-use API with detailed documentation and type hints.
  • OCR capabilities for extracting text from images and scanned documents.
  • Structured data extraction via Zod/Pydantic schemas for type-safe output.
  • Document classification for intelligently splitting multi-section documents.
  • Multimodal support for processing PDFs and images.
  • Support for multiple AI providers and models.
  • Batch processing for concurrent document handling with automatic error handling.
  • Built-in local parsers for DOCX, PPTX, XLSX, CSV, TSV, and PDF.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap were found in the provided text. The "Coming Soon" section indicates ongoing development.

Licensing & Compatibility

  • License Type: Apache 2.0.
  • Compatibility: Permissive licensing for commercial use, notably using pdftext (Apache/BSD) for PDF processing instead of AGPL-licensed alternatives like PyMuPDF.

Limitations & Caveats

Future enhancements are planned, including integration with more AI models and providers (e.g., Llama, Together AI, OpenRouter) and expanded multilingual support. The provided examples necessitate obtaining and configuring API keys for the chosen AI providers.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
278 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.