vision-parse by iamarunbrahma

CLI tool for parsing PDFs into markdown using vision LLMs

Created 1 year ago

456 stars

Top 66.2% on SourcePulse

Project Summary

Vision Parse is a Python library designed to convert PDF documents into markdown format using Vision Language Models (VLMs). It targets developers and researchers needing to extract structured content, including text, tables, and LaTeX equations, from scanned or complex PDFs, offering a streamlined approach with support for multiple LLM providers and local model hosting.

How It Works

The library leverages VLMs to "read" PDF documents, treating pages as images. It supports various models like GPT-4o, Gemini, and Ollama-hosted models (LLaVA, Llama3.2-vision). Users can specify extraction detail levels, image handling modes (URL or base64), and concurrency for faster processing. The core advantage lies in its ability to preserve rich content like LaTeX and hyperlinks, converting them into markdown, and offering flexibility through API-based or local model integrations.

Quick Start & Requirements

Install: pip install vision-parse or pip install 'vision-parse[all]' for full dependencies.
Prerequisites: Python >= 3.9. Ollama is required for local models. API keys for OpenAI, Azure OpenAI, Gemini, or DeepSeek are needed for cloud-based models.
Setup: Basic setup is quick via pip. Local model setup requires Ollama installation and model download. API key configuration is straightforward.
Docs: Getting Started, Usage, Supported Models, Parameters.

Highlighted Details

Achieves 92% accuracy on a benchmark dataset of ML papers using GPT-4o, outperforming MarkItDown (67%) and Nougat (79%).
Supports direct integration with OpenAI, Azure OpenAI, Google Gemini, and DeepSeek APIs.
Enables local, private, and offline processing via Ollama, with options for custom Ollama configurations.
Offers detailed_extraction for complex elements like LaTeX and tables, and enable_concurrency for multi-page parallel processing.

Maintenance & Community

The project is authored by Arun Brahma. Contribution guidelines are available.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Ollama's vision models may be slower and less accurate on complex documents compared to API-based models. Benchmark accuracy is dependent on the chosen VLM and parameter settings.

vision-parse by iamarunbrahma

Explore Similar Projects

noted.md by tejas-raskar

ingest by sammcj

pdf-ocr-obsidian by diegomarzaa

documind by DocumindHQ

e2m by wisupai

pdf-document-layout-analysis by huridocs

markpdfdown by MarkPDFdown

pymupdf4llm by pymupdf

nlm-ingestor by nlmatics

gptpdf by CosmosShadow

pdf-craft by oomol-lab

MinerU by opendatalab