OCRFlux  by chatdoc-com

Multimodal PDF to Markdown conversion toolkit

created 2 months ago
2,013 stars

Top 22.5% on sourcepulse

GitHubView on GitHub
Project Summary

OCRFlux is a multimodal toolkit designed for advanced PDF-to-Markdown conversion, specifically targeting complex layouts, tables, and cross-page content merging. It aims to improve upon existing OCR capabilities by providing cleaner, more readable text output for researchers and power users dealing with document digitization.

How It Works

OCRFlux utilizes a 3 billion parameter Visual Language Model (VLM) to process PDF pages and images. Its core innovation lies in its ability to handle complex document structures, including multi-column layouts, embedded figures, and intricate tables, while maintaining a natural reading order. A key differentiator is its native support for merging tables and paragraphs that span across multiple pages, a feature not commonly found in open-source OCR solutions.

Quick Start & Requirements

  • Installation: Requires a clean Python 3.11 conda environment. Install via pip: pip install -e . --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
  • Prerequisites: NVIDIA GPU (RTX 3090 or better recommended) with at least 12GB VRAM, 20GB disk space, poppler-utils, and specific fonts. CUDA 12.4 is implied by the flashinfer link.
  • Usage: Local execution requires downloading the OCRFlux-3B model. Command examples are provided for single files, directories, and API usage. Docker support is also available.
  • Links: Online Demo, GitHub

Highlighted Details

  • Achieves 0.967 Edit Distance Similarity (EDS) on the OCRFlux-bench-single benchmark, outperforming baselines by up to 0.187.
  • First open-source project to natively support cross-page table and paragraph merging.
  • Demonstrates high performance on table parsing (up to 0.912 TEDS for simple tables) and cross-page merging (0.986 F1 score for detection).
  • The 3B parameter model is designed to run on consumer GPUs like the RTX 3090.

Maintenance & Community

Developed and maintained by the ChatDOC team.

Licensing & Compatibility

Licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Requires a recent NVIDIA GPU with substantial VRAM, making it inaccessible for users without compatible hardware. The installation process emphasizes creating a clean environment due to potentially complex dependency management.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
52
Star History
2,039 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.