vision-parse  by iamarunbrahma

CLI tool for parsing PDFs into markdown using vision LLMs

Created 9 months ago
427 stars

Top 69.2% on SourcePulse

GitHubView on GitHub
Project Summary

Vision Parse is a Python library designed to convert PDF documents into markdown format using Vision Language Models (VLMs). It targets developers and researchers needing to extract structured content, including text, tables, and LaTeX equations, from scanned or complex PDFs, offering a streamlined approach with support for multiple LLM providers and local model hosting.

How It Works

The library leverages VLMs to "read" PDF documents, treating pages as images. It supports various models like GPT-4o, Gemini, and Ollama-hosted models (LLaVA, Llama3.2-vision). Users can specify extraction detail levels, image handling modes (URL or base64), and concurrency for faster processing. The core advantage lies in its ability to preserve rich content like LaTeX and hyperlinks, converting them into markdown, and offering flexibility through API-based or local model integrations.

Quick Start & Requirements

  • Install: pip install vision-parse or pip install 'vision-parse[all]' for full dependencies.
  • Prerequisites: Python >= 3.9. Ollama is required for local models. API keys for OpenAI, Azure OpenAI, Gemini, or DeepSeek are needed for cloud-based models.
  • Setup: Basic setup is quick via pip. Local model setup requires Ollama installation and model download. API key configuration is straightforward.
  • Docs: Getting Started, Usage, Supported Models, Parameters.

Highlighted Details

  • Achieves 92% accuracy on a benchmark dataset of ML papers using GPT-4o, outperforming MarkItDown (67%) and Nougat (79%).
  • Supports direct integration with OpenAI, Azure OpenAI, Google Gemini, and DeepSeek APIs.
  • Enables local, private, and offline processing via Ollama, with options for custom Ollama configurations.
  • Offers detailed_extraction for complex elements like LaTeX and tables, and enable_concurrency for multi-page parallel processing.

Maintenance & Community

The project is authored by Arun Brahma. Contribution guidelines are available.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Ollama's vision models may be slower and less accurate on complex documents compared to API-based models. Benchmark accuracy is dependent on the chosen VLM and parameter settings.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
4
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), and
1 more.

MinerU by opendatalab

1.2%
44k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.