docstrange  by NanoNets

Extract and convert data from any document to multiple formats

created 2 weeks ago

New!

308 stars

Top 86.9% on SourcePulse

GitHubView on GitHub
Project Summary

DocStrange is a Python library designed to extract and convert data from various document types, including PDFs, Word docs, images, and URLs, into multiple formats like Markdown, JSON, and CSV. It targets developers and users needing to process documents for AI, data analysis, or general information retrieval, offering both cloud-based and local processing options.

How It Works

DocStrange leverages advanced OCR and intelligent content extraction powered by AI models. It supports universal input formats and can output data as clean Markdown, structured JSON (with options for specific fields or custom schemas), HTML, or CSV. The library prioritizes LLM-friendly output and accurate table extraction, with an automatic fallback mechanism for OCR engines.

Quick Start & Requirements

  • Install: pip install docstrange
  • Local LLM Processing: Requires pip install 'docstrange[local-llm]', Ollama installed and running (ollama serve), and a model pulled (e.g., ollama pull llama3.2).
  • Online Demo: docstrange.nanonets.com
  • Documentation: Available via the library's usage examples and CLI.

Highlighted Details

  • Dual Processing Modes: Offers free, rate-limited cloud processing and local CPU/GPU processing for privacy.
  • Flexible Extraction: Supports extracting all data, specific fields, or data conforming to a JSON schema.
  • Universal Input/Output: Handles PDFs, DOCX, XLSX, PPTX, images, URLs, and outputs to Markdown, JSON, CSV, HTML, and plain text.
  • CLI & Python API: Provides both a command-line interface and a Python library for integration.

Maintenance & Community

The project is maintained by Nanonets. Support and discussions are available via email (support@nanonets.com), GitHub Issues, and GitHub Discussions.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Local LLM processing requires additional setup with Ollama. The MCP server for Claude Desktop integration is not included in the PyPI package and requires cloning the repository. Cloud processing has rate limits, with an authenticated tier offering 10,000 documents/month.

Health Check
Last commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
4
Star History
312 stars in the last 17 days

Explore Similar Projects

Starred by John Philip Morgan John Philip Morgan(Cofounder of Jasper), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

zerox by getomni-ai

1.0%
12k
OCR SDK for AI ingestion of documents with complex layouts
created 1 year ago
updated 2 months ago
Feedback? Help us improve.