Discover and explore top open-source AI tools and projects—updated daily.
PDF to Markdown conversion with OCR and image linking
Top 99.4% on SourcePulse
This project provides an automated workflow for converting PDF documents into structured Markdown files, specifically tailored for use with Obsidian. It leverages the Mistral AI OCR API to extract text and images, organizing them into a format that preserves document hierarchy and links images using Obsidian's ![[image-name]]
syntax, benefiting researchers, students, and knowledge workers who manage document collections.
How It Works
The pipeline processes PDFs by uploading them to the Mistral AI OCR API for text and image extraction. Extracted text is converted into Markdown, while images are saved separately and referenced within the Markdown using Obsidian-compatible wikilinks. Each processed PDF results in a dedicated output folder containing the Markdown file, raw OCR JSON cache, and extracted images, with original PDFs moved to a pdfs-done
directory to prevent reprocessing.
Quick Start & Requirements
https://markdownify.up.railway.app/
. Note: server load may affect availability.pip install -r requirements.txt
python app.py
(accessible at http://localhost:5000/
)pip install mistralai jupyter python-dotenv
MISTRAL_API_KEY
via .env
file or environment variable. Place PDFs in pdfs_to_process/
and run the pdf-markdown-ocr.ipynb
notebook.Highlighted Details
![[image-name]]
wikilinks for images.Maintenance & Community
Contributions for improving Obsidian compatibility are welcomed. No specific community channels (Discord/Slack) or roadmap details are provided in the README.
Licensing & Compatibility
The license is not explicitly stated in the provided README. Compatibility requires Obsidian to be configured to support ![[image-name]]
style links; deviations may necessitate script modifications.
Limitations & Caveats
The hosted web application may experience availability issues due to high server load. The image linking functionality is dependent on specific Obsidian configuration, potentially requiring user adjustments for optimal integration.
3 months ago
Inactive