pdf-ocr-obsidian  by diegomarzaa

PDF to Markdown conversion with OCR and image linking

Created 6 months ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an automated workflow for converting PDF documents into structured Markdown files, specifically tailored for use with Obsidian. It leverages the Mistral AI OCR API to extract text and images, organizing them into a format that preserves document hierarchy and links images using Obsidian's ![[image-name]] syntax, benefiting researchers, students, and knowledge workers who manage document collections.

How It Works

The pipeline processes PDFs by uploading them to the Mistral AI OCR API for text and image extraction. Extracted text is converted into Markdown, while images are saved separately and referenced within the Markdown using Obsidian-compatible wikilinks. Each processed PDF results in a dedicated output folder containing the Markdown file, raw OCR JSON cache, and extracted images, with original PDFs moved to a pdfs-done directory to prevent reprocessing.

Quick Start & Requirements

  • Hosted Web App: Accessible at https://markdownify.up.railway.app/. Note: server load may affect availability.
  • Local Web App:
    • Install: pip install -r requirements.txt
    • Run: python app.py (accessible at http://localhost:5000/)
    • Prerequisites: Python, project dependencies (e.g., Flask, Mistral AI client).
  • Jupyter Notebook:
    • Install: pip install mistralai jupyter python-dotenv
    • Prerequisites: Python 3.9+, Mistral AI API key (free), Jupyter Notebook environment.
    • Setup: Configure MISTRAL_API_KEY via .env file or environment variable. Place PDFs in pdfs_to_process/ and run the pdf-markdown-ocr.ipynb notebook.

Highlighted Details

  • Batch processing of multiple PDFs.
  • Preserves document hierarchy and extracts images.
  • Generates Obsidian-style ![[image-name]] wikilinks for images.
  • Includes OCR caching via JSON response files.
  • Offers hosted web app, local web app, and Jupyter Notebook execution modes.

Maintenance & Community

Contributions for improving Obsidian compatibility are welcomed. No specific community channels (Discord/Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The license is not explicitly stated in the provided README. Compatibility requires Obsidian to be configured to support ![[image-name]] style links; deviations may necessitate script modifications.

Limitations & Caveats

The hosted web application may experience availability issues due to high server load. The image linking functionality is dependent on specific Obsidian configuration, potentially requiring user adjustments for optimal integration.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.