marker  by datalab-to

CLI tool for converting PDFs and other documents to Markdown, JSON, and HTML

created 1 year ago
26,877 stars

Top 1.5% on sourcepulse

GitHubView on GitHub
Project Summary

Marker is a Python library designed for high-accuracy document conversion, transforming PDFs, PPTX, DOCX, and more into Markdown, JSON, or HTML. It targets researchers, developers, and power users needing to extract structured data, tables, equations, and code from documents, offering significant speed advantages over cloud services and other open-source tools.

How It Works

Marker employs a pipeline approach, leveraging models like Surya for text extraction and layout detection, followed by formatting and post-processing steps. It intelligently uses models only when necessary, optimizing for speed and accuracy. For enhanced results, it offers a "hybrid mode" that integrates LLMs (Gemini, Ollama) to handle complex elements like cross-page tables and inline math, significantly improving accuracy.

Quick Start & Requirements

  • Install via pip: pip install marker-pdf
  • For full format support: pip install marker-pdf[full]
  • Requires Python 3.10+ and PyTorch.
  • GPU/MPS acceleration is supported and recommended.
  • Official docs: https://github.com/datalab-to/marker

Highlighted Details

  • Benchmarks favorably against Llamaparse and Mathpix, achieving 95.67% heuristic score and 4.23 LLM score on average.
  • Projected throughput of 122 pages/second on an H100 GPU.
  • Hybrid mode with LLMs boosts table conversion accuracy to 0.907 (vs. 0.816 for marker alone).
  • Supports extensive customization via processors, renderers, and LLM services (Gemini, Ollama, OpenAI, Claude).

Maintenance & Community

  • Active development is indicated by ongoing updates and community discussions.
  • Discord server available for future development discussions.

Licensing & Compatibility

  • Model weights are licensed under CC-BY-NC-SA-4.0.
  • Commercial use is permitted for organizations under $5M USD gross revenue and under $5M in lifetime VC/angel funding, provided they are not competitive with the Datalab API.
  • Commercial licenses are available for purchase.

Limitations & Caveats

  • Very complex layouts, such as deeply nested tables or forms, may not be handled perfectly without LLM assistance.
  • Forms may not always render optimally.
  • Out-of-memory errors can occur with high worker counts; reducing workers or splitting large PDFs is advised.
Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
13
Issues (30d)
25
Star History
2,346 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Feedback? Help us improve.