docstrange by NanoNets

Extract and convert data from any document to multiple formats

Created 4 months ago

1,058 stars

Top 35.6% on SourcePulse

Project Summary

DocStrange is a Python library designed to extract and convert data from various document types, including PDFs, Word docs, images, and URLs, into multiple formats like Markdown, JSON, and CSV. It targets developers and users needing to process documents for AI, data analysis, or general information retrieval, offering both cloud-based and local processing options.

How It Works

DocStrange leverages advanced OCR and intelligent content extraction powered by AI models. It supports universal input formats and can output data as clean Markdown, structured JSON (with options for specific fields or custom schemas), HTML, or CSV. The library prioritizes LLM-friendly output and accurate table extraction, with an automatic fallback mechanism for OCR engines.

Quick Start & Requirements

Install: pip install docstrange
Local LLM Processing: Requires pip install 'docstrange[local-llm]', Ollama installed and running (ollama serve), and a model pulled (e.g., ollama pull llama3.2).
Online Demo: docstrange.nanonets.com
Documentation: Available via the library's usage examples and CLI.

Highlighted Details

Dual Processing Modes: Offers free, rate-limited cloud processing and local CPU/GPU processing for privacy.
Flexible Extraction: Supports extracting all data, specific fields, or data conforming to a JSON schema.
Universal Input/Output: Handles PDFs, DOCX, XLSX, PPTX, images, URLs, and outputs to Markdown, JSON, CSV, HTML, and plain text.
CLI & Python API: Provides both a command-line interface and a Python library for integration.

Maintenance & Community

The project is maintained by Nanonets. Support and discussions are available via email (support@nanonets.com), GitHub Issues, and GitHub Discussions.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Local LLM processing requires additional setup with Ollama. The MCP server for Claude Desktop integration is not included in the PyPI package and requires cloning the repository. Cloud processing has rate limits, with an authenticated tier offering 10,000 documents/month.

docstrange by NanoNets

Explore Similar Projects

documind by DocumindHQ

thepipe by emcf

knowledge-table by whyhow-ai

docext by NanoNets

text-extract-api by CatchTheTornado

nv-ingest by NVIDIA

trafilatura by adbar

sparrow by katanaml

omniparse by adithya-s-k

zerox by getomni-ai

langextract by google

MinerU by opendatalab