docstrange  by NanoNets

Extract and convert data from any document to multiple formats

Created 2 months ago
776 stars

Top 45.1% on SourcePulse

GitHubView on GitHub
Project Summary

DocStrange is a Python library designed to extract and convert data from various document types, including PDFs, Word docs, images, and URLs, into multiple formats like Markdown, JSON, and CSV. It targets developers and users needing to process documents for AI, data analysis, or general information retrieval, offering both cloud-based and local processing options.

How It Works

DocStrange leverages advanced OCR and intelligent content extraction powered by AI models. It supports universal input formats and can output data as clean Markdown, structured JSON (with options for specific fields or custom schemas), HTML, or CSV. The library prioritizes LLM-friendly output and accurate table extraction, with an automatic fallback mechanism for OCR engines.

Quick Start & Requirements

  • Install: pip install docstrange
  • Local LLM Processing: Requires pip install 'docstrange[local-llm]', Ollama installed and running (ollama serve), and a model pulled (e.g., ollama pull llama3.2).
  • Online Demo: docstrange.nanonets.com
  • Documentation: Available via the library's usage examples and CLI.

Highlighted Details

  • Dual Processing Modes: Offers free, rate-limited cloud processing and local CPU/GPU processing for privacy.
  • Flexible Extraction: Supports extracting all data, specific fields, or data conforming to a JSON schema.
  • Universal Input/Output: Handles PDFs, DOCX, XLSX, PPTX, images, URLs, and outputs to Markdown, JSON, CSV, HTML, and plain text.
  • CLI & Python API: Provides both a command-line interface and a Python library for integration.

Maintenance & Community

The project is maintained by Nanonets. Support and discussions are available via email (support@nanonets.com), GitHub Issues, and GitHub Discussions.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Local LLM processing requires additional setup with Ollama. The MCP server for Claude Desktop integration is not included in the PyPI package and requires cloning the repository. Cloud processing has rate limits, with an authenticated tier offering 10,000 documents/month.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
233 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.6%
5k
Python package for web text extraction
Created 6 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jerry Liu Jerry Liu(Cofounder of LlamaIndex), and
1 more.

sparrow by katanaml

0.1%
5k
Data processing & instruction calling tool using ML, LLM, and Vision LLM
Created 3 years ago
Updated 1 day ago
Starred by Travis Fischer Travis Fischer(Founder of Agentic), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

MinerU by opendatalab

1.7%
46k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.