vision-parse  by iamarunbrahma

CLI tool for parsing PDFs into markdown using vision LLMs

created 7 months ago
408 stars

Top 72.5% on sourcepulse

GitHubView on GitHub
Project Summary

Vision Parse is a Python library designed to convert PDF documents into markdown format using Vision Language Models (VLMs). It targets developers and researchers needing to extract structured content, including text, tables, and LaTeX equations, from scanned or complex PDFs, offering a streamlined approach with support for multiple LLM providers and local model hosting.

How It Works

The library leverages VLMs to "read" PDF documents, treating pages as images. It supports various models like GPT-4o, Gemini, and Ollama-hosted models (LLaVA, Llama3.2-vision). Users can specify extraction detail levels, image handling modes (URL or base64), and concurrency for faster processing. The core advantage lies in its ability to preserve rich content like LaTeX and hyperlinks, converting them into markdown, and offering flexibility through API-based or local model integrations.

Quick Start & Requirements

  • Install: pip install vision-parse or pip install 'vision-parse[all]' for full dependencies.
  • Prerequisites: Python >= 3.9. Ollama is required for local models. API keys for OpenAI, Azure OpenAI, Gemini, or DeepSeek are needed for cloud-based models.
  • Setup: Basic setup is quick via pip. Local model setup requires Ollama installation and model download. API key configuration is straightforward.
  • Docs: Getting Started, Usage, Supported Models, Parameters.

Highlighted Details

  • Achieves 92% accuracy on a benchmark dataset of ML papers using GPT-4o, outperforming MarkItDown (67%) and Nougat (79%).
  • Supports direct integration with OpenAI, Azure OpenAI, Google Gemini, and DeepSeek APIs.
  • Enables local, private, and offline processing via Ollama, with options for custom Ollama configurations.
  • Offers detailed_extraction for complex elements like LaTeX and tables, and enable_concurrency for multi-page parallel processing.

Maintenance & Community

The project is authored by Arun Brahma. Contribution guidelines are available.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Ollama's vision models may be slower and less accurate on complex documents compared to API-based models. Benchmark accuracy is dependent on the chosen VLM and parameter settings.

Health Check
Last commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
51 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.