OCRFlux by chatdoc-com

Multimodal PDF to Markdown conversion toolkit

Created 8 months ago

2,484 stars

Top 18.3% on SourcePulse

Project Summary

OCRFlux is a multimodal toolkit designed for advanced PDF-to-Markdown conversion, specifically targeting complex layouts, tables, and cross-page content merging. It aims to improve upon existing OCR capabilities by providing cleaner, more readable text output for researchers and power users dealing with document digitization.

How It Works

OCRFlux utilizes a 3 billion parameter Visual Language Model (VLM) to process PDF pages and images. Its core innovation lies in its ability to handle complex document structures, including multi-column layouts, embedded figures, and intricate tables, while maintaining a natural reading order. A key differentiator is its native support for merging tables and paragraphs that span across multiple pages, a feature not commonly found in open-source OCR solutions.

Quick Start & Requirements

Installation: Requires a clean Python 3.11 conda environment. Install via pip: pip install -e . --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
Prerequisites: NVIDIA GPU (RTX 3090 or better recommended) with at least 12GB VRAM, 20GB disk space, poppler-utils, and specific fonts. CUDA 12.4 is implied by the flashinfer link.
Usage: Local execution requires downloading the OCRFlux-3B model. Command examples are provided for single files, directories, and API usage. Docker support is also available.
Links: Online Demo, GitHub

Highlighted Details

Achieves 0.967 Edit Distance Similarity (EDS) on the OCRFlux-bench-single benchmark, outperforming baselines by up to 0.187.
First open-source project to natively support cross-page table and paragraph merging.
Demonstrates high performance on table parsing (up to 0.912 TEDS for simple tables) and cross-page merging (0.986 F1 score for detection).
The 3B parameter model is designed to run on consumer GPUs like the RTX 3090.

Maintenance & Community

Developed and maintained by the ChatDOC team.

Licensing & Compatibility

Licensed under Apache 2.0, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Requires a recent NVIDIA GPU with substantial VRAM, making it inaccessible for users without compatible hardware. The installation process emphasizes creating a clean environment due to potentially complex dependency management.

OCRFlux by chatdoc-com

Explore Similar Projects

ferrules by AmineDiro

MarkEverythingDown by RoffyS

SmartResume by alibaba

api-llm-ocr by yigitkonur

mineru-tianshu by magicyuan876

HunyuanOCR by Tencent-Hunyuan

vits-simple-api by Artrajz

mPLUG-DocOwl by X-PLUG

text-extract-api by CatchTheTornado

MegaParse by QuivrHQ

omniparse by adithya-s-k

olmocr by allenai