Discover and explore top open-source AI tools and projects—updated daily.
Vision-Language Model for end-to-end complex document parsing
Top 51.4% on SourcePulse
Summary Logics-Parsing is an end-to-end document parsing model specifically engineered for complex layouts and STEM content. Built upon a general Vision-Language Model (VLM) fine-tuned with Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), it provides a streamlined, single-model pipeline for extracting structured information from documents. The system excels at accurately recognizing and structuring intricate elements like scientific formulas and chemical structures, the latter being exportable to the standard SMILES format. It outputs rich, categorized HTML with precise bounding boxes and OCR text, while intelligently filtering out irrelevant content.
How It Works The core architecture utilizes a general VLM augmented via SFT and RL techniques. This approach enables a unified, end-to-end processing pipeline, effectively eliminating the need for complex, multi-stage document analysis workflows. It demonstrates advanced capabilities in recognizing and structuring challenging scientific formulas and chemical structures. The output is a semantically rich HTML representation, where each content block is precisely tagged with its category, bounding box coordinates, and OCR text, while automatically filtering out extraneous elements like headers and footers.
Quick Start & Requirements
Installation involves creating a Conda environment with Python 3.10, activating it, and then running pip install -r requirement.txt
. Model weights can be downloaded from Modelscope or Hugging Face repositories; users must first install the respective client libraries (modelscope
or huggingface_hub
) and execute a provided download script. Inference is performed via a Python script, requiring specification of the input image path, desired output path, and the local path to the downloaded model weights.
Highlighted Details
Maintenance & Community The project acknowledges inspiration and references from open-source initiatives including Qwen2.5-VL, OmniDocBench, and Mathpix. The provided README does not contain information regarding specific community channels (e.g., Discord, Slack), active contributors, or detailed maintenance status.
Licensing & Compatibility The README does not explicitly state the software license under which Logics-Parsing is distributed. Potential adopters should carefully investigate and confirm licensing terms, particularly concerning commercial use or integration into proprietary systems.
Limitations & Caveats Performance claims are substantiated by an in-house benchmark; details regarding its composition, scope, and public availability are not elaborated upon in the README. No other explicit limitations, such as unsupported platforms, known bugs, or alpha/beta status, are mentioned.
3 days ago
Inactive