CLI tool for parsing PDFs into markdown using vision LLMs
Top 72.5% on sourcepulse
Vision Parse is a Python library designed to convert PDF documents into markdown format using Vision Language Models (VLMs). It targets developers and researchers needing to extract structured content, including text, tables, and LaTeX equations, from scanned or complex PDFs, offering a streamlined approach with support for multiple LLM providers and local model hosting.
How It Works
The library leverages VLMs to "read" PDF documents, treating pages as images. It supports various models like GPT-4o, Gemini, and Ollama-hosted models (LLaVA, Llama3.2-vision). Users can specify extraction detail levels, image handling modes (URL or base64), and concurrency for faster processing. The core advantage lies in its ability to preserve rich content like LaTeX and hyperlinks, converting them into markdown, and offering flexibility through API-based or local model integrations.
Quick Start & Requirements
pip install vision-parse
or pip install 'vision-parse[all]'
for full dependencies.Highlighted Details
detailed_extraction
for complex elements like LaTeX and tables, and enable_concurrency
for multi-page parallel processing.Maintenance & Community
The project is authored by Arun Brahma. Contribution guidelines are available.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Ollama's vision models may be slower and less accurate on complex documents compared to API-based models. Benchmark accuracy is dependent on the chosen VLM and parameter settings.
5 months ago
1 week