Logics-Parsing  by alibaba

Vision-Language Model for end-to-end complex document parsing

Created 1 month ago
651 stars

Top 51.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary Logics-Parsing is an end-to-end document parsing model specifically engineered for complex layouts and STEM content. Built upon a general Vision-Language Model (VLM) fine-tuned with Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), it provides a streamlined, single-model pipeline for extracting structured information from documents. The system excels at accurately recognizing and structuring intricate elements like scientific formulas and chemical structures, the latter being exportable to the standard SMILES format. It outputs rich, categorized HTML with precise bounding boxes and OCR text, while intelligently filtering out irrelevant content.

How It Works The core architecture utilizes a general VLM augmented via SFT and RL techniques. This approach enables a unified, end-to-end processing pipeline, effectively eliminating the need for complex, multi-stage document analysis workflows. It demonstrates advanced capabilities in recognizing and structuring challenging scientific formulas and chemical structures. The output is a semantically rich HTML representation, where each content block is precisely tagged with its category, bounding box coordinates, and OCR text, while automatically filtering out extraneous elements like headers and footers.

Quick Start & Requirements Installation involves creating a Conda environment with Python 3.10, activating it, and then running pip install -r requirement.txt. Model weights can be downloaded from Modelscope or Hugging Face repositories; users must first install the respective client libraries (modelscope or huggingface_hub) and execute a provided download script. Inference is performed via a Python script, requiring specification of the input image path, desired output path, and the local path to the downloaded model weights.

Highlighted Details

  • Achieves state-of-the-art performance on an in-house benchmark designed to comprehensively evaluate complex-layout document parsing and STEM content comprehension.
  • Generates detailed, structured HTML output, meticulously tagging each content block (paragraphs, tables, figures, formulas) with its category, precise bounding box coordinates, and OCR text.
  • Accurately recognizes scientific formulas and chemical structures, with the capability to export chemical structures into the standard SMILES format.
  • Features an effortless end-to-end processing pipeline due to its single-model architecture, simplifying deployment and inference workflows.

Maintenance & Community The project acknowledges inspiration and references from open-source initiatives including Qwen2.5-VL, OmniDocBench, and Mathpix. The provided README does not contain information regarding specific community channels (e.g., Discord, Slack), active contributors, or detailed maintenance status.

Licensing & Compatibility The README does not explicitly state the software license under which Logics-Parsing is distributed. Potential adopters should carefully investigate and confirm licensing terms, particularly concerning commercial use or integration into proprietary systems.

Limitations & Caveats Performance claims are substantiated by an in-house benchmark; details regarding its composition, scope, and public availability are not elaborated upon in the README. No other explicit limitations, such as unsupported platforms, known bugs, or alpha/beta status, are mentioned.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
19
Star History
656 stars in the last 30 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

MinerU by opendatalab

2.4%
47k
PDF extraction tool for converting PDFs to Markdown and JSON
Created 1 year ago
Updated 11 hours ago
Feedback? Help us improve.