Logics-Parsing by alibaba

Vision-Language Model for end-to-end complex document parsing

Created 6 months ago

900 stars

Top 40.2% on SourcePulse

Project Summary

Summary Logics-Parsing is an end-to-end document parsing model specifically engineered for complex layouts and STEM content. Built upon a general Vision-Language Model (VLM) fine-tuned with Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), it provides a streamlined, single-model pipeline for extracting structured information from documents. The system excels at accurately recognizing and structuring intricate elements like scientific formulas and chemical structures, the latter being exportable to the standard SMILES format. It outputs rich, categorized HTML with precise bounding boxes and OCR text, while intelligently filtering out irrelevant content.

How It Works The core architecture utilizes a general VLM augmented via SFT and RL techniques. This approach enables a unified, end-to-end processing pipeline, effectively eliminating the need for complex, multi-stage document analysis workflows. It demonstrates advanced capabilities in recognizing and structuring challenging scientific formulas and chemical structures. The output is a semantically rich HTML representation, where each content block is precisely tagged with its category, bounding box coordinates, and OCR text, while automatically filtering out extraneous elements like headers and footers.

Quick Start & Requirements Installation involves creating a Conda environment with Python 3.10, activating it, and then running pip install -r requirement.txt. Model weights can be downloaded from Modelscope or Hugging Face repositories; users must first install the respective client libraries (modelscope or huggingface_hub) and execute a provided download script. Inference is performed via a Python script, requiring specification of the input image path, desired output path, and the local path to the downloaded model weights.

Highlighted Details

Achieves state-of-the-art performance on an in-house benchmark designed to comprehensively evaluate complex-layout document parsing and STEM content comprehension.
Generates detailed, structured HTML output, meticulously tagging each content block (paragraphs, tables, figures, formulas) with its category, precise bounding box coordinates, and OCR text.
Accurately recognizes scientific formulas and chemical structures, with the capability to export chemical structures into the standard SMILES format.
Features an effortless end-to-end processing pipeline due to its single-model architecture, simplifying deployment and inference workflows.

Maintenance & Community The project acknowledges inspiration and references from open-source initiatives including Qwen2.5-VL, OmniDocBench, and Mathpix. The provided README does not contain information regarding specific community channels (e.g., Discord, Slack), active contributors, or detailed maintenance status.

Licensing & Compatibility The README does not explicitly state the software license under which Logics-Parsing is distributed. Potential adopters should carefully investigate and confirm licensing terms, particularly concerning commercial use or integration into proprietary systems.

Limitations & Caveats Performance claims are substantiated by an in-house benchmark; details regarding its composition, scope, and public availability are not elaborated upon in the README. No other explicit limitations, such as unsupported platforms, known bugs, or alpha/beta status, are mentioned.

Logics-Parsing by alibaba

Explore Similar Projects

ferrules by AmineDiro

doctran by finic-ai

SmartResume by alibaba

spacy-layout by explosion

DocBank by doc-analysis

DeepSeek-OCR-Web by fufankeji

thepipe by emcf

OmniDocBench by opendatalab

MonkeyOCR by Yuliang-Liu

dots.ocr by rednote-hilab

Dolphin by bytedance

MinerU by opendatalab