Extract and convert data from any document to multiple formats
New!
Top 86.9% on SourcePulse
DocStrange is a Python library designed to extract and convert data from various document types, including PDFs, Word docs, images, and URLs, into multiple formats like Markdown, JSON, and CSV. It targets developers and users needing to process documents for AI, data analysis, or general information retrieval, offering both cloud-based and local processing options.
How It Works
DocStrange leverages advanced OCR and intelligent content extraction powered by AI models. It supports universal input formats and can output data as clean Markdown, structured JSON (with options for specific fields or custom schemas), HTML, or CSV. The library prioritizes LLM-friendly output and accurate table extraction, with an automatic fallback mechanism for OCR engines.
Quick Start & Requirements
pip install docstrange
pip install 'docstrange[local-llm]'
, Ollama installed and running (ollama serve
), and a model pulled (e.g., ollama pull llama3.2
).Highlighted Details
Maintenance & Community
The project is maintained by Nanonets. Support and discussions are available via email (support@nanonets.com), GitHub Issues, and GitHub Discussions.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Local LLM processing requires additional setup with Ollama. The MCP server for Claude Desktop integration is not included in the PyPI package and requires cloning the repository. Cloud processing has rate limits, with an authenticated tier offering 10,000 documents/month.
6 days ago
Inactive